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Preface 


This volume summarizes the research done and results obtained in the second 
funding phase of the Priority Program 1648 “Software for Exascale Computing” 
(SPPEXA) of the German Research Foundation (DFG). In that respect, it both 
provides an overview of SPPEXA’s achievements and represents a continuation 
of Vol. 113 in Springer’s series “Lecture Notes in Computational Science and 
Engineering”, the corresponding report of SPPEXA’s first funding phase. 

For some general remarks on the uniqueness of SPPEXA—as the first strategic, 
i.e. board-initiated Priority Program of DFG; as the first tri-national Priority 
Program with synchronized collaborative research in Germany, France, and Japan; 
as a multi-disciplinary endeavor involving informatics and mathematics, but also 
various fields from engineering, the sciences, and the life sciences; and as the first 
holistic approach to research on High-Performance Computing (HPC) software 
at the level of fundamental research—we refer to the overview contribution of 
Bungartz et al. (see chapter “Software for Exascale Computing: Some Remarks 
on the Priority Program SPPEXA”) in this volume. There, also some statistics are 
provided. 

The spirit of the international collaboration, whether in a bi-lateral (German— 
Japanese) or in a tri-lateral (French-Japanese—German) setting, can be found and 
felt in several of the reports of 16 out of 17 SPPEXA consortia. This structured 
and institutionalized collaboration was not easy to establish, and we are grateful 
for the shared enthusiasm, commitment, and support of the three involved funding 
agencies: the German Research Foundation (DFG), the Agence Nationale de la 
Recherche (ANR), and the Japan Science and Technology Agency (JST). The 
synergies emerging from bringing together the expertise of groups from three 
countries did not only boost the respective project work itself, it also prepared 
the ground for ongoing partnerships as well as for a topical extension towards the 
interplay of HPC and Artificial Intelligence—a field that both benefits tremendously 
from HPC and, at the same time, fosters HPC with new concepts. 

As always, many people helped to make SPPEXA in general and this volume 
in particular a great success. Concerning the first, our thanks go to the agencies 
already mentioned and their responsible officers; then to all the SPPEXA researchers 
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in 17 consortia who made SPPEXA such a wonderful and productive research 
experience; and finally to all helping hands that supported SPPEXA in terms of 
organizing and hosting events such as workshops, doctoral retreats, minisymposia, 
gender workshops, annual plenary meetings, and so forth. Moreover, concerning the 
preparation of this volume, we are grateful to Dr. Martin Peters and Leonie Kunz 
from Springer for their support—as in previous cases, it was again a pleasure to 
collaborate. Finally, we thank Mirco Troue, Tina Angerer, and Michael Obersteiner 
for their support in proofreading and compiling this book. 

The first exascale systems are expected to be available in about one year. For sure, 
there is still a lot of work to be done to let cutting-edge science applications fully 
exploit their potential. However, we are fully convinced that SPPEXA contributed 
significantly to pave the way towards exascale computers and their usage. 


Garching, Germany Hans-Joachim Bungartz 
Garching, Germany Severin Reiz 
Eindhoven, Netherlands Benjamin Uekermann 
Hamburg, Germany Philipp Neumann 


Dresden, Germany Wolfgang E. Nagel 
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Part I 
SPPEXA: The Priority Program 


Software for Exascale Computing: Some A 
Remarks on the Priority Program gii 
SPPEXA 


Hans-Joachim Bungartz, Wolfgang E. Nagel, Philipp Neumann, Severin Reiz, 
and Benjamin Uekermann 


Abstract SPPEXA, the Priority Program 1648 “Software for Exa-scale Comput- 
ing” of the German Research Foundation (DFG), was established in 2012. SPPEXA 
was DFG’s first strategic Priority Program—strategic in the sense that it had been 
the initiative of DFG’s board to suggest a larger and trans-disciplinary funding 
scheme to support the development of software at all levels that would be able 
to benefit from future exa-scale systems. A proposal had been formulated by a 
team of scientists representing domains across the STEM fields, evaluated in the 
standard format for Priority Programs, and financed via special funds. Operations 
started in January 2013, and after two 3-year funding phases and a cost-neutral 
extension, SPPEXA’s activities will come to an end by end of April, 2020. A final 
international symposium took place on October 21-23, 2019, in Dresden, and this 
volume of Springer’s Lecture Notes in Computational Science and Engineering— 
the second SPPEXA-related one after the corresponding report of Phase 1 (see 
Appendix 3 in [1])—contains reports of 16 out of 17 SPPEXA projects (the project 
ExaSolvers will deliver its report as a special issue of Springer’s journal Computing 
and Visualization in Science) and is, thus, a comprehensive overview of research 
within SPPEXA. 

While each single project report emphasizes the respective project’s individual 
research outcomes and, thus, provides one perspective of research in SPPEXA, 
this contribution, co-authored by the two scientific coordinators—Hans-Joachim 


H.-J. Bungartz - S. Reiz (%4) 
Technical University of Munich, Garching, Germany 
e-mail: bungartz @in.tum.de; reiz@in.tum.de 


W. E. Nagel 
Technical University of Dresden, Dresden, Germany 


P. Neumann 
Helmut-Schmidt-Universität Hamburg, Hamburg, Germany 
e-mail: philipp.neumann @hsu-hh.de 


B. Uekermann 
Eindhoven University of Technology, Eindhoven, Netherlands 
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Bungartz and Wolfgang E. Nagel—and by three of the four researchers that 
have served as program coordinator over the years—Philipp Neumann, Benjamin 
Uekermann, and Severin Reiz—emphasizes the program SPPEXA itself. It provides 
an overview of the design and implementation of SPPEXA, it highlights its 
accompanying and supporting activities (internationalization, in particular with 
France and Japan; workshops; doctoral retreats; diversity-related measures), and it 
provides some statistics. It, thus, complements the papers from SPPEXA’s research 
consortia collected in this volume. 


1 Preparation 


While supercomputers were recognized early as an important research infrastructure 
for German science and have been since then on the agenda (recommendations of 
the German Science Council (Wissenschaftsrat), introduction of the performance 
pyramid, Gauss Centre for Supercomputing, Gauss Alliance, NHR—Nationales 
Hochleistungsrechnen), the situation for supercomputing has always been quite 
different. First, the funds for HPC systems are typically limited to investments, 
i.e. the machinery; the current NHR initiative takes a more comprehensive view. 
Second, software development is frequently not considered as “science”, which 
entails that neither typical projects in informatics or mathematics nor their coun- 
terparts in fields of application cover more than prototype development. Recently, 
BMBF’s HPC software program and DFG’s sustainable scientific software initiative, 
fortunately, have acknowledged the crucial role of software for HPC and support 
software development explicitly. Third, HPC software development has happened in 
Collaborative Research Centers or similar formats before, but mostly in an isolated 
way: an informatics initiative contained an HPC software project as an application, 
or a physics initiative contained a simulation- or HPC-oriented project. But all this 
hardly ever looked at more than one peculiar aspect at a time, and it was at most an 
interdisciplinary endeavor of two fields. 

However, when Moore’s law at least gets exhausted a bit and performance 
gains are more and more achievable through a more and more massive parallelism 
only, it is obvious that software and its performance and scalability play an 
increasingly crucial part. Therefore, the challenges at the eve of the exa-scale era 
required more—and that’s actually what happened elsewhere, for example in the 
U.S. or in Japan: a significant, concerted initiative, bringing together informatics, 
mathematics, and several domains of application, comprising all relevant aspects of 
HPC software. That’s where SPPEXA entered the stage. 


2 Design Principles 


SPPEXA was designed to provide a holistic approach to HPC software, comprising 
the aspects most relevant for ensuring the efficient use of current and upcoming 
high-end supercomputers, and to do this via exploring both evolutionary and 
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disruptive research threads. Six research directions were identified as crucial ones: 
(1) Computational Algorithms, (2) Application Software, (3) System Software and 
Runtime Libraries, (4) Programming, (5) Software Tools, and (6) Data Manage- 
ment. Computational algorithms, such as fast linear solvers or eigensolvers, are a 
core numerical component of many large-scale application codes—both classical 
simulation-driven and recent data analytics-oriented ones. If scalability cannot be 
ensured here, the battle is already almost lost. Application software is the “user” of 
HPC systems, typically appearing as legacy codes that have been developed over 
many years. Increasing their performance via a co-design that addresses both the 
“systems—algorithms” and the “‘algorithms—applications/models” interfaces and 
combines algorithm and performance engineering is vital. Performance engineering 
can’t succeed without progress in compilers, monitoring, code optimization, verifi- 
cation support, and parallelization support (such as auto-tuning)—which underlines 
the importance of system software and runtime libraries as well as of tools. 
Programming, including programming models, is probably the topic where the need 
for a balance of evolutionary research (improve and extend existing programming 
models, e.g.) and revolutionary approaches (explore new programming models, 
new language concepts such as Domain-Specific Languages) gets most obvious. 
Data management, finally, has always been HPC-relevant in terms of I/O or post- 
processing and visualization, and it is of ever-increasing importance since more and 
more HPC applications are on the data side. 

To ensure the impact of this holistic idea, it was clear that having a set of 
projects in our Priority Program where some address this issue and others that one, 
and where they may collaborate or not, would not suffice. Therefore, SPPEXA’s 
concept was to have a set of larger projects, or project consortia (research units— 
Forschergruppen), that would all have to address at least two of the six big topics 
with their research agenda; and that would all have to combine a relevant large-scale 
application with HPC-methodical advancements. This means that neither a merely 
domain-driven research (“improve my code, and this is a contribution to HPC in 
itself”), as we see it frequently in domain-driven research initiatives (Collaborative 
Research Centers in physics, life sciences, or engineering, e.g.), nor a generic purely 
algorithmic research (“if I improve my solver, this will help everyone”), as we see 
it frequently in mathematics- or informatics-driven research initiatives, would be 
allowed to find their place in SPPEXA. This was somewhat challenging, since we 
had to communicate this concept clearly and to convince potential applicants and 
reviewers that everyone should really comply with this agenda. 

Furthermore, there is one property better known from Collaborative Research 
Centers than from Priority Programs: program-wide joint activities. For example, 
we wanted to have a vivid collaboration framework of cross-project workshops; 
networking with the big international programs; a focus on education also through 
fostering novel teaching formats or coding weeks and doctoral retreats for the 
doctoral candidates; gender-related activities to understand, evaluate and work 
towards a more gender-balanced research community; etc. This allowed for sharing 
mutual best practices in HPC for the mathematics- or informatics- or application- 
driven areas. Therefore, there was more coordination than we see in typical Priority 
Programs. 
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3 Funded Projects and Internal Structure 


In the first funding phase, the following thirteen projects or project consortia were 
funded:! 


CATWALK—A Quick Development Path for Performance Models. Felix Wolf 
(Darmstadt), Christian Bischof (Darmstadt), Torsten Hoefler (Zürich), Bernd Mohr (Jülich), 
and Gabriel Wittum (Frankfurt) 


ESSEX—Equipping Sparse Solvers for Exa-scale. Gerhard Wellein (Erlangen), Achim 
Basermann (Köln), Holger Fehske (Greifswald), Georg Hager (Erlangen), and Bruno Lang 
(Wuppertal) 

Exa-Dune—Flexible PDE Solvers, Numerical Methods, and Applications. Peter Bastian 
(Heidelberg), Olaf Ippisch (Clausthal), Mario Ohlberger (Miinster), Christian Engwer 
(Münster), Stefan Turek (Dortmund), Dominik Géddeke (Stuttgart), and Oleg Iliev 
(Kaiserslautern) 

ExaFSA—Exa-scale Simulation of Fluid-Structure-Acoustics Interactions. Miriam Mehl 
(Stuttgart), Hester Bijl (Delft), Sabine Roller (Siegen), Dörte Sternel (Darmstadt), and Thomas 
Ertl (Stuttgart) 

EXAHD—An Exa-Scalable 2-Level Sparse Grid Approach for Higher-Dimensional 
Problems in Plasma Physics and Beyond. Dirk Pfliiger (Stuttgart), Hans-Joachim Bungartz 
(Miinchen), Michael Griebel (Bonn), Markus Hegland (Canberra), Frank Jenko (Garching), 
and Hermann Lederer (Garching) 

EXAMAG—Fxa-scale Simulations of the Evolution of the Universe Including Magnetic 
Fields. Volker Springel (Heidelberg) and Christian Klingenberg (Wiirzburg) 
ExaSolvers—Extreme-scale Solvers for Coupled Problems. Lars Grasedyck (Aachen), 
Wolfgang Hackbusch (Leipzig), Rolf Krause (Lugano), Michael Resch (Stuttgart), Volker 
Schulz (Trier), and Gabriel Wittum (Frankfurt) 

EXASTEEL—Bridging Scales for Multiphase Steels. Daniel Balzani (Bochum), Axel 
Klawonn (Köln), Oliver Rheinbach (Freiberg), Jorg Schröder (Duisburg-Essen), and Gerhard 
Wellein (Erlangen) 

ExaStencils—Advanced Stencil-Code Engineering. Christian Lengauer (Passau), Armin 
Größlinger (Passau), Ulrich Rüde (Erlangen), Harald Kostler (Erlangen), Sven Apel 
(Saarbriicken), Jiirgen Teich (Erlangen), Frank Hannig (Erlangen), and Matthias Bolten 
(Wuppertal) 

FFMK—A Fast and Fault-tolerant Microkernel-Based System for Exa-scale Computing. 
Hermann Härtig (Dresden), Alexander Reinefeld (Berlin), Amnon Barak (Jerusalem), and 
Wolfgang E. Nagel (Dresden) 

GROMEX—Unified Long-range Electrostatics and Dynamic Protonation for Realistic 
Biomolecular Simulations on the Exa-scale. Helmut Grubmiiller (Göttingen), Holger 
Dachsel (Jiilich), and Berk Hess (Stockholm) 

DASH—Smart Data Structures and Algorithms with Support for Hierarchical Locality. 
Karl Fiirlinger (München), Colin W. Glass (Stuttgart), José Gracia (Stuttgart), and Andreas 
Kniipfer (Dresden) 

Terra-Neo—Integrated Co-Design of an Exa-scale Earth Mantle Modeling Framework. 
Hans-Peter Bunge (München), Ulrich Rüde (Erlangen), Gerhard Wellein (Erlangen), and 
Barbara Wohlmuth (München) 


1 Some Principal Investigators have changed affiliation during the SPPEXA program. We specified 
the most recent main affiliation here. 
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After 3 years, twelve of those got a prolongation for the second funding phase, 
some with an “international extension” (bi-national with Japanese partners or tri- 
national with French and Japanese partners): 


ESSEX-2—Equipping Sparse Solvers for Exa-scale. Gerhard Wellein (Erlangen), Achim 
Basermann (Köln), Holger Fehske (Greifswald), Georg Hager (Erlangen), Bruno Lang 
(Wuppertal), Tetsuya Sakurai (Tsukuba; Japanese partner), and Kengo Nakajima (Tokyo; 
Japanese partner) 


Exa-Dune—Flexible PDE Solvers, Numerical Methods, and Applications. Peter Bastian 
(Heidelberg), Olaf Ippisch (Clausthal), Mario Ohlberger (Miinster), Christian Engwer 
(Münster), Stefan Turek (Dortmund), Dominik Géddeke (Stuttgart), and Oleg Iliev 
(Kaiserslautern) 


ExaFSA—Exa-scale Simulation of Fluid-Structure-Acoustics Interactions. Miriam Mehl 
(Stuttgart), Alexander van Zuijlen (Delft), Thomas Ertl (Stuttgart), Sabine Roller (Siegen), 
Dorte Sternel (Darmstadt), and Hiroyuki Takizawa (Tohoku; Japanese partner) 


EXAHD—An Exa-Scalable 2-Level Sparse Grid Approach for Higher-Dimensional 
Problems in Plasma Physics and Beyond. Dirk Pflüger (Stuttgart), Hans-Joachim Bungartz 
(Miinchen), Michael Griebel (Bonn), Markus Hegland (Canberra), Frank Jenko (Garching), 
and Tilman Dannert (Garching) 


EXAMAG—FExa-scale Simulations of the Magnetic Universe. Volker Springel 
(Heidelberg), Christian Klingenberg (Wiirzburg), Naoki Yoshida (Tokyo; Japanese partner), 
and Philippe Helluy (Strasbourg; French partner) 


ExaSolvers—Extreme-scale Solvers for Coupeld Problems. Lars Grasedyck (Aachen), Rolf 
Krause (Lugano), Michael Resch (Stuttgart), Volker Schulz (Trier), Gabriel Wittum 
(Frankfurt), Arne Nagel (Frankfurt), Hiroshi Kawai (Tokyo; Japanese partner), and Ryuji 
Shioya (Toyo; Japanese partner) 

EXASTEEL-2—Dual Phase Steels—From Micro to Macro Properties. Daniel Balzani 
(Bochum), Axel Klawonn (Koln), Oliver Rheinbach (Freiberg), Jorg Schröder 
(Duisburg-Essen), Olaf Schenk (Lugano), and Gerhard Wellein (Erlangen) 


ExaStencils—Advanced Stencil-Code Engineering. Christian Lengauer (Passau), Ulrich 
Rüde (Erlangen), Harald Köstler (Erlangen), Sven Apel (Saarbrücken), Jürgen Teich 
(Erlangen), Frank Hannig (Erlangen), Matthias Bolten (Wuppertal), and Shigeru Chiba (Tokyo; 
Japanese partner) 


FFMK—A Fast and Fault-tolerant Microkernel-Based System for Exa-scale Computing. 
Hermann Hiartig (Dresden), Alexander Reinefeld (Berlin), Amnon Barak (Jerusalem), and 
Wolfgang E. Nagel (Dresden) 


GROMEX—Unified Long-range Electrostatics and Dynamic Protonation for Realistic 
Biomolecular Simulations on the Exa-scale. Helmut Grubmiiller (Göttingen), Holger 
Dachsel (Jiilich), and Berk Hess (Stockholm) 

DASH—Smart Data Structures and Algorithms with Support for Hierarchical Locality. 
Karl Fiirlinger (München), Colin W. Glass (Stuttgart), José Gracia (Stuttgart), and Andreas 
Kniipfer (Dresden) 

Terra-Neo—Integrated Co-Design of an Exa-scale Earth Mantle Modeling Framework. 
Hans-Peter Bunge (München), Ulrich Rüde (Erlangen), and Barbara Wohlmuth (München) 
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Furthermore, four new project consortia joined SPPEXA: 


ADA-FS—Advanced Data Placement via Ad-hoc File Systems at Extreme Scales. 
Wolfgang E. Nagel (Dresden), André Brinkmann (Mainz), and Achim Streit (Karlsruhe) 


AIMES—Advanced Computation and I/O Methods for Earth-System Simulations. 
Thomas Ludwig (Hamburg), Thomas Dubos (Versailles; French partner), Naoya Maruyama 
(RIKEN; Japanese partner), and Takayuki Aoki (Tokyo; Japanese partner) 


ExaDG—High-order Discontinuous Galerkin for the Exa-scale. Guido Kanschat 
(Heidelberg), Katharina Kormann (Miinchen), Martin Kronbichler (Miinchen), and Wolfgang 
A. Wall (Miinchen) 


MYX—MUST Correctness Checking for YML and XMP Programs. Matthias S. Miiller 
(Aachen), Serge Petiton (Lille; French partner), Nahid Emad (Versailles; French partner), 
Taisuke Boku (Tsukuba; Japanese partner), and Hitoshi Murai (RIKEN; Japanese partner) 


Finally, 1 year later, a seventeenth project joined SPPEXA as associated project: 


ExtraPeak—Automatic Performance Modeling of HPC Applications. Felix Wolf 
(Darmstadt) and Torsten Hoefler (Ziirich) 


Hence, overall, there have been four Japanese-German and three French- 
Japanese-German consortia within SPPEXA. On the German side, an overall sum 
of 57 principal investigators from 39 institutions have been involved, representing 
informatics (25), mathematics (19), engineering (8), natural sciences (4), and life 
sciences (1). 

Concerning governance, SPPEXA was headed by its two Spokespersons Hans- 
Joachim Bungartz (Technical University of Munich—TUM) and Wolfgang E. 
Nagel (Technical University of Dresden). For the everyday organization, a Program 
Coordinator (in chronological order: Benjamin Peherstorfer, now professor at New 
York University; Philipp Neumann, now professor at Helmut-Schmidt-University 
Hamburg; Benjamin Uekermann, now with Eindhoven University of Technology; 
and Severin Reiz, TUM) as well as an Office were established (both at TUM). 
Strategic decisions in SPPEXA were taken by the Steering Committee, consisting of 
H.-J. Bungartz, W. E. Nagel, as well as Sabine Roller (Siegen), Christian Lengauer 
(Passau), Hans-Peter Bunge (München), Dörte Sternel (Darmstadt), and—in the 
second funding phase—Nahid Emad (France) and Takayuki Aoki (Japan). Finally, 
a Scientific Advisory Board supported our activities and planning: George Biros 
(University of Texas at Austin), Rupak Biswas (NASA), Klaus Becker (Airbus), 
Rob Schreiber (at that time HP Labs), and Craig Stewart (University of Indiana at 
Bloomington). 
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4 SPPEXA Goes International 


Extreme-scale HPC has always been an international endeavor. In 2010, as the 
first call in the framework of the G8 Research Councils’ Initiative on Multilateral 
Research Funding, the topic Application Software towards Exa-scale Computing 
for Global Scale Issues had been selected. In the sequel of that initiative, the idea 
arose to give SPPEXA in its second funding phase a more international flavor, 
beyond the individual international partners present in some of the consortia. DFG’s 
head office contacted several of their partner institutions in other countries. While 
it turned out to be complicated to synchronize activities with the National Science 
Foundation (NSF) in the U.S., the discussions with the French Agence Nationale 
de la Recherche (ANR) and the Japan Science and Technology Agency (JST) 
became very concrete. Finally, for the first time, a funding phase of a complete 
DFG Priority Program was linked to funding formats from two other countries, and 
the three agencies combined their forces in a joint call run by DFG. Due to formal 
restrictions, two new types of SPPEXA consortia were open for application: bi- 
national Japanese-German or tri-national French-Japanese-German ones. 

Overall, the following French institutions participated in SPPEXA projects: 
Université de Versailles, Université de Strasbourg, and Maison de la Simulation, 
Saclay. From the Japanese side, the involved partner institutions involved were 
RIKEN, Tokyo University of Technology, University of Tsukuba, University of 
Tokyo, Tohoku University, Tokyo University of Science, and Toyo University. 
Beyond research in the single consortia, one SPPEXA doctoral retreat was held in 
France, and SPPEXA co-organized three French-Japanese-German workshops—the 
first one 2017 in the French embassy in Tokyo, the second one in 2018 in the German 
embassy in Tokyo, and the third one in 2019, again in the French embassy. The 
first two focused on exa-scale computing, while the third one did a move towards 
artificial intelligence (AJ) and, in particular, addressed the convergence of AI and 
HPC. 

Further internationalization measures were the SPPEXA guest program, the 
research stays for doctoral candidates (up to 3 months; overall 25 taken in funding 
phase 2), and our PR activities at the big international meetings. For example, 
SPPEXA organized panels or sessions at the Supercomputing Conference (SC) and 
the International Supercomputing Conference (ISC HPC) and participated in the 
session and poster exhibition on DFG-funded collaborative research at DATE 2019. 


5 Joint Coordinated Activities 


As mentioned above, SPPEXA featured a rich program of joint cross-consortium 
activities (the following numbers refer to funding phase 2, 2016-2019): 


Guests Overall, more than 85 guest researchers visited one or more SPPEXA 
projects. 
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Workshops Workshops were a particular format to foster exchange and collab- 
oration across project consortia. Central funds had been established for that, and 
each SPPEXA PI could hand in proposals (two calls per year). The proposal had to 
depict how the cross-consortium effect was to be ensured (more than one organizing 
consortium, etc.). Overall, 41 SPPEXA workshops, held at conferences or stand- 
alone, were supported via this channel. 


Doctoral Retreats The SPPEXA Doctoral Retreat had two main goals—first, to 
offer an additional educational component to our doctoral candidates; second, to 
overcome the sometimes narrow borders of research by connecting with interna- 
tional researchers on a doctoral level (guest lectures, own contributions, hands-on 
sessions, ...). Overall, three doctoral retreats were organized: Strasbourg (2016), 
Dresden (2017), and Wuppertal (2018). 


Doctoral Research Stays Following the successful model of TUM Graduate 
School, where each doctoral candidate university-wide can get funds for an 
international research stay of up to 3 months, we encouraged our doctoral candidates 
SPPEXA-wide to enrich their PhD phase with such an international component. 
Overall, 25 such research interns were funded, examples for destinations being ETH 
Zurich, NORCE Bergen, or University of Tennessee. 


Gender Activities Looking at the gender situation in HPC, it is obvious that 
the presence of women is even worse than in general in informatics. To improve 
that situation and to provide a more open atmosphere, a couple of measures 
were taken. At every Annual Plenary Meeting (2016, 2017, 2018, and 2019), we 
organized gender trainings by external coaches to raise awareness of gender biases 
in academia, each with 25 participants. Additionally, SPPEXA members organized 
workshop-like events such as student MINT mentoring days (2016-2018) and 
women’s networking events in 2019. Moreover, we connected to industry (Bosch 
and IBM) via gender bias discussion days called “Equality at Exascale”. Exceptional 
at this event was that not only women participated, but we had an ideal gender-parity 
in participants. 


Impact on Education As a side effect, HPC education also got a boost by 
SPPEXA. Numerous lectures and lab courses were updated, and a lot of student 
theses had topics directly related to SPPEXA projects. 


Prizes During the second phase of SPPEXA, every year, the best student and 
doctoral theses SPPEXA-wide were awarded a prize. Over the years, the winners 
were: 


2016: Klaudius Scheufele (Stuttgart, master’s thesis) and Benjamin Uekermann 
(Munich, PhD thesis); 

2017: Sebastian Schweikl (Passau, bachelor’s thesis), Simon Schwitanski (Aachen, 
master’s thesis), and Moritz Kreutzer (Erlangen, PhD thesis); 

2018/2019: Piet Jarmatz (Munich/Hamburg, master’s thesis) and Sebastian Kuckuk 
and Christian Schmitt (Erlangen, PhD thesis). 
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Support of Young Researchers For sustainability in academia, supporting young 
aspiring researchers is indispensable. We took measures by funding research stays 
for doctoral candidates and awarding prizes for exceptional theses. Additionally, we 
also supported bachelor and master students for the student cluster competition at 
the (international) supercomputing conferences SC and ISC HPC 2016-2019. 


Public Relations Dissemination of research becomes more and more important. 
Continuing efforts from the first phase, SPPEXA featured articles in the InSiDE 
magazine, published by the GAUSS Center for Supercomputing, twice per year in 
2016, 2017, and 2018 introducing one project each time. Furthermore, starting 2018, 
SPPEXA contributed five articles to the online platform Science Node.” Last, in 
2018, SPPEXA also featured an article in the EU Research magazine. 


Internationalisation See previous Sect. 4. 


6 HPC Goes Data 


The computational revolution goes on! Computers and sophisticated computational 
methods have shaped the “third paradigm”, the third path to insight in science, 
complementing the classical approaches, theory and experiment, but also building 
a bridge and providing the missing link between those two. An early incarna- 
tion of “computational” were numerical simulations, later expanded by so-called 
“outer-loop scenarios”, in which repeated simulations allow for enhanced results: 
optimization, parameter identification, stochastics, or uncertainty quantification. 
All of this, basically, was model-driven, following a deductive regime of model 
hypotheses and derivations from them. The latest appearance of “computational” 
can be characterized by the focus on data: data-enhanced simulation, data analytics, 
machine learning, or artificial intelligence. Instead of being based on models, this 
approach is much more data-driven, following an inductive regime of collecting 
data and drawing conclusions from them. In simplified words, the “data from 
models” turned into, or was complemented by, a “models from data”. Despite 
that shift of focus, the basic underlying principle did not change: state-of-the-art 
computer systems and state-of-the-art computational methods are combined and 
used to advance the frontier of science. Something new is maybe the fact that the 
club of scientific domains that benefit from the “third paradigm” has become bigger: 
While numerical simulation was, more or less, driven by natural, engineering, and 
life sciences, the data-centered approach comprises all domains, including social 
sciences and humanities. 

Of course, this development has a huge impact on HPC. In particular, new fields 
and new types of applications popped up, as well as new lines of architectures 
and systems. For example, in 2018, the majority of finalists for the Gordon Bell 


*https://sciencenode.org/feature/the-race-to-exascale.php. 
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Award, the most renowned prize in HPC, already had a significant amount of 
machine learning in their papers. World-wide, HPC centers observe an increasing 
share of data-driven jobs on their machines. This is not surprising: as science 
and science methodology evolve, the kind of studies done in that context also 
does. Despite all those changes, the role of HPC is astonishingly stable: HPC is 
a core enabling technology of “computational”. It was and still is an enabler of 
numerical simulation, and it has become a crucial enabler of data analytics and 
artificial intelligence. If artificial intelligence, machine learning, or deep learning 
have become so popular recently, this is much more due to the fact that established 
methodology can succeed due to HPC, than due to new AI/ML/DL methodology 
itself. 

These developments are also visible at the end of SPPEXA. Several consortia 
already are on that “data-driven track”, as, for example, our third French-Japanese- 
German workshop in Tokyo showed. 


7 Shaping the Landscape 


When SPPEXA started in 2013, the core idea was to significantly improve algo- 
rithms, software, and tools, in order to be prepared for the exa-scale age. In the 
meantime, we are at the eve of exa-scale systems, as the co-design developments 
in the U.S. and in Japan (Fugaku) or the discussions in the European Union on 
exa-scale and pre-exa-scale systems show. And research in SPPEXA has definitely 
contributed to the application landscape in Germany being much closer to “exa- 
scale-readiness” than before. Several leading application software packages were 
involved, and significant progress in terms of scalability and parallel efficiency could 
be achieved. Furthermore, and maybe even more important, the SPPEXA consortia 
showed the advantages of the multi-disciplinary engagement, brought together a lot 
of groups and ideas disconnected before, and, thus, justified the concept of larger, 
cross-institutional, and cross-disciplinary teams instead of single-PI projects. 

The visibility SPPEXA got is stunning. SPPEXA was present at the leading 
international conferences (Euro-Par, Supercomputing, ISC HPC)—through individ- 
ual presentations and special events, such as minisymposia or panels. But also at 
“neighboring” events, such as the DATE 2019 (Design, Automation, and Test in 
Europe), SPPEXA had a presentation slot and a booth. SPPEXA was involved in 
the activities (workshops, white papers, etc.) of the BDEC Community (Big Data 
and Extreme-Scale Computing) as well as in the organization of the Long Program 
“Science at Extreme Scales: Where Big Data Meets Large-scale Computing” at the 
Institute for Pure and Applied Mathematics (IPAM) in Los Angeles, and it co- 
organized a French-Japanese-German workshop series in Tokyo (cf. the section 
on internationalization). Thus, at an international scale, SPPEXA was generally 
perceived as the “German player” in the HPC software concert. 
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8 Concluding Remarks 


Without any doubt, SPPEXA has written a success story: in terms of its research, 
concerning the innovative funding format, with its multi-disciplinary approach, its 
multi-national facets, and—last, but not least—its huge visibility. We are grateful for 
all the support we got from the German Research Foundation (DFG): the funding, 
but also for the encouragement during the preparation of SPPEXA and the continued 
advice during its runtime. 


Appendix 1: Qualification 


The following achievements have been completed in the SPPEXA program within 
1.1.2016 and 30.04.2020: 


Projects Completed PhD theses | Completed habilitations | Calls to professorship 
AMES | Od 


aas A o ë boo = ë ë To 
past h bo Cd 
ESSEX 0 
eas Ja ooo 
EDan |a po o o ë fi 
masa Jo o bo ë o= ë ë To 
exa J3 o= bho o= ë T 
EXAMAG Jo o ë boo ëo 
BuSoivers |1 ë bho ä ë o=ăć ë f 
exasreer |2 = bo ë o= ë Ni 
ExaStenci 4 
ExtaPeak E A E, 
FMK E a S 
œomex 2 — bp — bo 
wo oo e ëa 
Temaro J3 þh = To 
Coordination 2 
Overall 2 


The previous table follows the DFG requirements for final reports in priority 
programs. At least 25 additional PhD candidates are close to being finished; 
however, due to the lengthy defense procedure they are not counted here. 

Also, please take into account that project consortia vary in size (regarding 
Principal Investigators and PhD candidates) and their start/end date. 
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Appendix 2: Software from Project Consortia 


In the following, a table with links to software that has been developed by the project 
consortia in SPPEXA Phase-II is given. 


Project Software developed 
AIMES SCIL 
github.com/JulianKunkel/scil 
ADA-FS GekkoFS 
tu-dresden.de/zih/forschung/projekte/ada- fs 
DASH DASH 
www.dash-project.org/ 
ESSEX PHIST, GHOST, CRAFT, RACE, ScaMaC 
bitbucket.org/essex/{PHIST, ..., RACE, matrixcollection} 
ExaDG deal. 
github.com/dealii/dealii 


EXA-Dune DUNE 
gitlab.dune-project.org/exadune 
ExaFSA preCICE 
github.com/precice/precice 
Ateles 
apes.osdn.io/pages/ateles 
EXAHD SG++ 
github.com/SGpp/SGpp 
EXAMAG AREPO 
arepo-code.org/ 
ExaSolvers utopia 
bitbucket.org/zulianp/utopia/src/master/ 
EXASTEEL FE2TI 
www.numerik.uni-koeln.de/14079.html 
ExaStencils LFA Lab 
hrittich. github.io/lfa-lab/ 
ExaSlang 
i10git.cs.fau.de/exastencils/release 
ExtraPeak Extra-P 
www.scalasca.org/scalasca/software/extra- p/ 
FFMK FFMK 
ffmk.tudos.org/ 
GROMEX GROMACS 
www.gromacs.org 
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MYX MUST 

doc. itc.rwth-aachen.de/display/CCP/Projectt+MUST 
Terra-Neo HyTeG 

i 10git.cs.fau.de/hyteg/hyteg 

waLBerla 

www. walberla.net/ 

TerraNeo 


terraneo.fau.de/ 


Appendix 3: Project Consortia Key Publications 


This volume represents a continuation of the corresponding report in SPPEXA 
Phase-I, which is referenced several times in the text above: 


1. Bungartz, H.-J., Neumann, P., Nagel, W.E.: Software for Exascale Computing- 
SPPEXA 2013-2015, vol. 113. Springer, Berlin (2016) 


SPPEXA Phase-II showed visibility in the research community with numerous 
publications. In the following we provide a list of two key publications for each 
project consortium:* 


AIMES 

1. Jum’ah, N., Kunkel, J.: Performance portability of earth system models with 
user-controlled GGDML code translation. In: International Conference on High 
Performance Computing, pp. 693-710. Springer, Berlin (2018) 

2. Kunkel, J., Novikova, A., Betke, E., Schaare, A.: Toward decoupling the 
selection of compression algorithms from quality constraints. In: International 
Conference on High Performance Computing, pp. 3—14. Springer, Berlin (2017) 


ADA-FS 

1. Vef, M.A., Moti, N., Süß, T., Tocci, T., Nou, R., Miranda, A., Cortes, T., 
Brinkmann, A.: GekkoFS—a temporary distributed file system for HPC appli- 
cations. In: 2018 IEEE International Conference on Cluster Computing (CLUS- 
TER), pp. 319-324. IEEE, Piscataway (2018) 

2. Soysal, M., Berghoff, M., Klusáček, D., Streit, A.: On the quality of wall 
time estimates for resource allocation prediction. In: Proceedings of the 48th 
International Conference on Parallel Processing: Workshops, pp. 1-8. ACM, 
New York (2019) 


DASH 

1. Kowalewski, R., Jungblut, P., Fiirlinger, K.: Engineering a distributed histogram 
sort. In: 2019 IEEE International Conference on Cluster Computing (CLUS- 
TER), pp. 1-11. IEEE, Piscataway (2019) 


Following the DFG requirements for final reports in priority programs. 
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. Fürlinger, K., Glass, C., Gracia, J., Knüpfer, A., Tao, J., HHtinichnich, D., 


Idrees, K., Maiterth, M., Mhedheb, Y., Zhou, H.: DASH: data structures and 
algorithms with support for hierarchical locality. In: European Conference on 
Parallel Processing, pp. 542-552. Springer, Berlin (2014) 


ESSEX 


1. 


Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., 
Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter 
diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226- 
243 (2016) 


. Rohrig-Z6llner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Baser- 


mann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the 
Jacobi—Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697—C722 
(2015) 


ExaDG 


1. 


2. 


Kronbichler, M., Kormann, K.: Fast matrix-free evaluation of discontinuous 
Galerkin finite element operators. ACM Trans. Math. Softw. 45(3), 1-40 (2019) 
Fehn, N., Wall, W.A., Kronbichler, M.: Efficiency of high-performance dis- 
continuous Galerkin spectral element methods for under-resolved turbulent 
incompressible flows. Int. J. Numer. Methods Fluids 88(1), 32-54 (2018) 


EXA-Dune 


1. 


Bastian, P., Engwer, C., Gdddeke, D., Iliev, O., Ippisch, O., Ohlberger, M., Turek, 
S., Fahlke, J., Kaulmann, S., Steffen Müthing, S., et al.: EXA-DUNE: flexible 
PDE solvers, numerical methods and applications. In: European Conference on 
Parallel Processing, pp. 530-541. Springer, Berlin (2014) 


. Engwer, C., Altenbernd, M., Dreier, N.A., Göddeke, D.: A high-level C++ 


approach to manage local errors, asynchrony and faults in an MPI application. 
In: 2018 26th Euromicro International Conference on Parallel, Distributed and 
Network-based Processing (PDP), pp. 714-721. IEEE, Piscataway (2018) 


ExaFSA 


1. 


Mehl, M., Uekermann, B., Bijl, H., Blom, D., Gatzhammer, B., Van Zuijlen, 
A.: Parallel coupling numerics for partitioned fluid—structure interaction simula- 
tions. Comput. Math. Appl. 71(4), 869-891 (2016) 


. Totounferoush, A., Pour, N.E., Schröder, J., Roller, S., Mehl, M.: A new load 


balancing approach for coupled multi-physics simulations. In: 2019 IEEE Inter- 
national Parallel and Distributed Processing Symposium Workshops (IPDPSW), 
pp. 676—682. IEEE, Piscataway (2019) 


EXAHD 


1. 


Obersteiner, M., Hinojosa, A.P., Heene, M., Bungartz, H.J., Pflüger, D.: A 
highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma 
simulations. In: Proceedings of the 8th Workshop on Latest Advances in Scalable 
Algorithms for Large-Scale Systems, pp. 1—8 (2017) 
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2. Hupp, P., Heene, M., Jacob, R., Pfliiger, D.: Global communication schemes for 


the numerical solution of high-dimensional PDEs. Parallel Comput. 52, 78-105 
(2016) 


ExaSolvers 


1. 


Benedusi, P., Garoni, C., Krause, R., Li, X., Serra-Capizzano, S.: Space-time 
FE-DG Discretization of the anisotropic diffusion equation in any dimension: 
the spectral symbol. SIAM J. Matrix Anal. Appl. 39(3), 1383-1420 (2018) 


. Kreienbuehl, A., Benedusi, P., Ruprecht, D., Krause, R.: Time-parallel gravita- 


tional collapse simulation. Commun. Appl. Math. Comput. Sci. 12(1), 109-128 
(2015) 


ExaStencils 


1. 


K6stler, H., Schmitt, C., Kuckuk, S., Kronawitter, S., Hannig, F., Teich, J., Rüde, 
U., Lengauer, C.: A scala prototype to generate multigrid solver implementations 
for different problems and target multi-core platforms. Int. J. Comput. Sci. Eng. 
14(2), 150-163 (2017). https://doi.org/10.1504/IJCSE.2017.082879 


. Schmitt, C., Kronawitter, S., Hannig, F., Teich, J., Lengauer, C.: Automating the 


development of high-performance multigrid solvers. Proc. IEEE 106(11), 1969- 
1984 (2018) 


ExtraPeak 


1. 


Shudler, S., Calotoiu, A., Hoefler, T., Wolf, F.: Isoefficiency in practice: config- 
uring and understanding the performance of task-based applications. In: ACM 
SIGPLAN Notices, vol. 52, pp. 131-143. ACM, New York (2017) 


. Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance 


modeling to find scalability bugs in complex codes. In: Proceedings of the 
International Conference on High Performance Computing, Networking, Storage 
and Analysis, p. 45. IEEE, Piscataway (2013) 


FFMK 


1. 


Weinhold, C., Lackorzynski, A., Hartig, H.: FFMK: an HPC OS based on 
the L4Re microkernel. In: Operating Systems for Supercomputers and High 
Performance Computing, pp. 335-357. Springer, Berlin (2019) 


. Gholami, M., Schintke, F.: Multilevel checkpoint/restart for large computational 


jobs on distributed computing resources. In: IEEE 38th Symposium on Reliable 
Distributed System (SRDS) (2019) 


GROMEX 


1. 


Beckmann, A., Kabadshow, I.: Portable node-level performance optimization for 
the fast multipole method. In: Recent Trends in Computational Engineering- 
CE2014, pp. 29—46. Springer, Berlin (2015) 


. Kutzner, C., Pall, S., Fechner, M., Esztermann, A., de Groot, B.L., Grubmiiller, 


H.: More bang for your buck: Improved use of GPU nodes for GROMACS 2018. 
J. comput. chem. 40(27), 2418-2431 (2019) 
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MYX 

1. Protze, J., Tsuji, M., Terboven, C., Dufaud, T., Murai, H., Petiton, S., Emad, N., 
Müller, M., Boku, T.: Myx—runtime correctness analysis for multi-level parallel 
programming paradigms. In: Software for Exascale Computing: SPPEXA 2016- 
2019. Lecture Notes in Computational Science and Engineering. Springer, Berlin 
(2020) 

2. Protze, J., Schulz, M., Ahn, D.H., Miiller, M.S.: Thread-local concurrency: a 
technique to handle data race detection at programming model abstraction. In: 
Proceedings of the 27th International Symposium on High-Performance Parallel 
and Distributed Computing, pp. 144-155 (2018) 


Terra-Neo 

1. Bauer, S., Huber, M., Ghelichkhan, S., Mohr, M., Rüde, U., Wohlmuth, B.: 
Large-scale simulation of mantle convection based on a new matrix-free 
approach. J. Comput. Sci. 31, 60-76 (2019) 

2. Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively 
parallel multigrid solvers. SIAM J. Sci. Comput. 38(5), S217—S239 (2016) 
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A Perspective on the SPPEXA A) 
Collaboration from France hap for 


Nahid Emad 


As the French member of the Steering Committee of SPPEXA, it is my great 
pleasure to give a short address to this volume from the perspective of the 
French partners in this German-French-Japanese cooperation. To highlight the 
types of software supported by SPPEXA, we first present a classification of high- 
performance software types. We then take a look at the recent activities of HPC 
software in France under the SPPEXA umbrella. Next, some local impacts of the 
SPPEXA collaboration on the French HPC community is provided, and lastly, an 
outlook to future collaborations. 


1 HPC Software in Three Phases 


High-performance numerical software targets at obtaining relevant scalability in 
space and time for large-size applications by using a large number of cores/pro- 
cessors/nodes of powerful computers. They can be classified into three phases: 
pre-treatment, treatment, and post-treatment. Obviously, such a software often 
belongs to more than one of the categories mentioned. 


Pre-processing Software The precise definition of these phases depends on the 
context, but the main role of pre-treatment software is the preparation of the 
input data for the treatment phase. This preparation sometimes consists of a rather 
complex parallel algorithmic and programming processing. Big data compression, 
uniform data formatting, and conditioning improvement of data matrices are some 
examples of the pre-treatment phase. 


N. Emad (24) 

University of Paris Saclay, UVSQ, LI PARAD, Maison de la Simulation, 
Versailles, France 

e-mail: nahid.emad @uvsq. fr 


© The Author(s) 2020 19 
H.-J. Bungartz et al. (eds.), Software for Exascale Computing - SPPEXA 

2016-2019, Lecture Notes in Computational Science and Engineering 136, 
https://doi.org/10.1007/978-3-030-47956-5_2 


20 N. Emad 


Treatment Software The HPC software in the processing phase concerns mainly 
high-performance numerical simulation of physical phenomena, social networks, 
etc. These softwares could be classified as (1) “standard” libraries and (2) ad-hoc 
libraries done by application field programmers, which implement their application 
by making use of building blocks (partially or totally) of the libraries of class (1). 
The latter implement numerical methods with the main parallel/distributed program- 
ming methodologies, such as ScaLAPACK, PETSc, SLEPc, ATLAS, etc. In the 
fields of application , HPC ad-hoc software targets epidemiology, electromagnetism, 
gamma astronomy, safety, or health and nutrition. Some ad-hoc software examples 
are CEDRE which targets simulation for energy and propulsion, CELESTIA and 
STELLARIUM which is a space simulator and a planetarium for observing the solar 
system and the rest of the universe in real time and in 3D, GAUSS, which is a flexible 
platform for data analysis, and AREPO, which is a cosmological hydrodynamical 
simulation code on a dynamic unstructured mesh. 


Post-processing Software Post-processing software essentially involves the anal- 
ysis, visualization, and performance evaluation of the treatment phase results. 
Some examples of such software are ParaView (a multi-platform data analysis and 
visualization application), VisIt (an interactive platform for visualization, animation 
and data analysis), MAQAO (sets of software tools for code optimization in the 
core or node level of a parallel architecture), and Maya (a software for modeling, 
simulation, and 3D animation). 

All these software packages generally translate a physical phenomenon, social 
behaviour, etc. into mathematical equations. Their high-performance implementa- 
tion on parallel and/or distributed systems is a delicate task and requires a huge 
ecosystem with people having interdisciplinary skills. This makes the existence 
and use of accompanying software necessary, which provides the logistics of high- 
performance computing. These software frameworks provide the environment for 
high-performance programming and often conceal the complexity of underlying 
parallel and/or distributed architectures. As a consequence this allows the users to 
focus on main objectives. MPI, OpenMP, Globus, Condor, XMP, YML, MUST, etc. 
are very few examples of this kind of software. 


2 Trilateral Projects in SPPEXA and Their Impact 


SPPEXA targeted fundamental research on different aspects of HPC software and 
covered software categories cited before with a co-design approach. Thanks to 
ANR, DFG, and JST, the trilateral French/German/Japanese projects have been 
funded within SPPEXA. Some of these projects, such as MYX, have benefited 
from pre-existing bilateral collaborations. This allowed a dynamical and productive 
work from the beginning and for a rapid progress towards the objectives set. In 
addition to project meetings, the cross-project SPPEXA workshops have given a 
new dynamic to the trilateral collaborations paving the way for the organization of 
other conferences, workshops or seminars. 
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The EXAMAG (Exascale Simulations of the Magnetic Universe) project is an 
example of an SPPEXA trilateral French/German/Japanese project with the aim of 
improving the astrophysical moving-mesh code AREPO and extending its range 
of applicability for high scale computing platforms. EXAMAG is an ad-hoc HPC 
software of the processing category of the classification given in the previous 
section. 

MYX (MUST Correctness Checking for YML and XMP Programs) also is a 
trilateral French/German/Japanese project which aims to offer a guideline how to 
limit the risk to introduce errors and how to best express the parallelism to catch 
errors at runtime. From a practical viewpoint, MYX aims at the design and the 
application of a scalable correctness checking tool MUST to YML and XMP. In 
the MYX project, the main developed software packages (YML, XMP and MUST) 
belong to the last category of HPC software; the ones providing the logistics of high- 
performance application programs. However, in order to validate the design and 
development of these softwares, many other benchmarks and/or real applications 
are developed. Among them are the multiple restarted Krylov methods/HPCS ad- 
hoc, matrix generator/pretreatment , epidemic HPCS ad-hoc, etc. 

The SPPEXA funding of workshops with several projects involved added an 
extra dimension of interdisciplinarity. In collaboration with the DASH and ESSEX- 
II projects, MYX members organized four trilateral (German/Japanese/French) 
workshops. Two of them have been hold at university of Paris Saclay/Versailles 
in France. With the prominent invited speakers and the talks of SPPEXA-involved 
project members, these workshops have been very attractive (40 and 60 attendees, 
respectively). An important number of indirect outcomes of SPPEXA activities 
(workshops, “open” trilateral meetings, doctoral retreats, etc.) generated new 
connections between German and French colleagues and students. A few examples 
are > the review of the PhD dissertation of a non-SPPEXA funded French student by 
Sabine Roller, professor at Siegen University, Germany, > Xinzhe Wu, who finished 
recently his PhD, funded by ANR part of ANR/DFG/JST MYX project and, who 
is currently in a post-doctoral position at Jiilich Research Centre in Germany, > 
M.A. Diop, currently PhD student, funded by a French CIFRE followship (with 
ATOS/EVIDIAN company), who participated in SPPEXA doctoral retreats as well 
as several SPPEXA workshops, > a workhop organised by Sabine Roller and Nahid 
Emad at HPC Asia, or a large number of BSc and MSc students benefiting from the 
collaboration. 


3 What Will Be Next? 


The on-going convergence between machine learning, data analysis, and high- 
performance computing is creating new algorithmic and co-design approaches that 
need to be taken into account for the future. With the three tri-national workshops 
in Tokyo, SPPEXA has contributed to this on-going development, and we are all 
looking forward to a continued collaboration in the future. 
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A national research project was running in Japan from 2010 to 2017 named “Devel- 
opment of System Software Technologies for Post-Peta Scheme High Performance 
Computing” (so called Post-Peta CREST) . It was supported by JST (Japan Science 
and Technology Agency), which is the Japanese counterpart to DFG (“Deutsche 
Forschungsgemeinschaft”). The Post-Peta CREST project was similar to the first 
funding phase of SPPEXA in the sense that it had a primarily national scope. 
Then, the Post-Peta CREST project opened up to international collaboration, and 
some projects were extended for two more years, where they formed collaborative 
research groups with SPPEXA phase-II projects. Projects with contributions from 
Japan are ExaFSA, ExaStencils, EXAMAG, ESSEX-II, EXASOLVERS, AIMES, and 
MYX, with more than 10 researchers in the second phase of SPPEXA. To highlight 
the success of the Japanese collaboration with SPPEXA, we have a brief look at two 
working groups. 


ppOpen-HPC and ESSEX-II 


As a part of Post-Peta CREST projects between 2011 and 2015, a group at the 
University of Tokyo developed ppOpen-HPC, which is an open source infrastructure 
for the development and execution of optimized and reliable simulation code on 
post-peta-scale (pp) parallel computers based on many-core architectures. The 
framework covers various types of procedures for scientific computations in various 
types of computational models, such as FEM, FDM, FVM, BEM, and DEM. 
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Automatic tuning (AT) technology enables automatic generation of optimized 
libraries and applications under various types of environments. The most updated 
version of ppOpen-HPC was released as open source software, which is available at 
https://github.com/Post-Peta-Crest/ppOpenHPC. 

In 2016, the team of ppOpen-HPC joined the SPPEXA phase-II project ESSEX- 
II including members from the University of Erlangen-Nuremberg, which is funded 
by JST-CREST and SPPEXA under Japan (JST)-Germany (DFG) collaboration 
until 2018. ESSEX-II developed pK-Open-HPC (extended version of ppOpen-HPC, 
a framework for exa-feasible applications), such as preconditioned iterative solvers 
for quantum science. 

Sparse coefficient matrices derived from applications in quantum science have 
generally relatively very small diagonal components, and they are generally ill- 
conditioned. Therefore, it is difficult to apply preconditioned iterative methods 
developed in ppOpen-HPC directly to such applications. The ESSEX-II team 
developed a regularization method for robustness based on blocking and diagonal 
shifting, which provide efficient and robust convergence of ill-conditioned problems 
in quantum science. Preconditioning methods with the regularization method are 
implemented in GHOST/PHIST libraries for solving matrices, which integrates 
all linear solvers and related methods developed in ESSEX/ESSEX-II projects. 
Moreover, they proposed a new method for global parallel reordering, which 
provides robust and efficient convergence of parallel iterative solvers with ILU- 
based preconditioning for very ill-conditioned problems. The developed method 
kept iteration number constant in strong scaling cases up to O(104) MPI processes 
for very ill-conditioned problems. This is the first method for global parallel 
reordering. 

In the ESSEX-II project, CRAFT (A library for application-level Check- 
point/Restart and Automatic Fault Tolerance) has been developed for fault resilience 
on exascale systems by checkpointing. ESSEX-II integrated the dynamic load 
balancing function and CRAFT, and developed a prototype of a fault-resilient 
framework for parallel FEM applications. Parallel FEM codes can continue 
computations by this framework, when some of the computing nodes fail. This 
framework does not need spare nodes for fault resilience. This idea can be extended 
to various types of procedures for dynamic scheduling on exascale systems. 

Collaborations in ESSEX-II project have been continuing in the JHPCN projects 
(“Numerical Library with High-Performance/Adaptive-Precision/High-Reliability” 
(starting in 2018), “Innovative Multigrid Methods” (starting in 2018)), and in 
“Innovative Methods for Scientific Computing in the Exascale Era by Integrations 
of (Simulation+Data+Learning)” funded by “Grant-in-Aid for Scientific Research 
(S) (KAKENHI S)” (2019-2023) 
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Xevolver and ExaFSA 


The so-called Xevolver project is one of the Post-Peta CREST projects from 2011 
to 2017. A group at the Tohoku University discussed how they could help in legacy 
code migration to future-generation extreme-scale computing systems that will be 
massively parallel and heterogeneous. Even today an HPC application code is likely 
optimized assuming a particular system configuration, and hence specialized only 
for its target system. In general, such an application is not performance-portable 
at all. As the HPC system architectures are now diverging and also getting more 
complicated in terms of accelerators, it will require more time and effort to migrate 
or re-optimize the code to another system in the future. To make matters worse, 
system-specific code optimizations are tightly interwoven with the computation 
and thereby degrade the code readability and maintainability, even though HPC 
applications need to evolve not only for achieving high performance, but also for 
advancing computational science. Therefore, in the project, our team has developed 
a code transformation framework, Xevolver, so that users can define their own 
code transformations and thus express system-specific code optimizations as code 
transformation rules. Since code transformation rules can be defined separately from 
application codes themselves, the Xevolver framework can contribute to separation 
of system-specific performance concerns from application codes, and hence prevent 
overcomplicating the codes. 

In 2016, core members of the Xevolver research team joined the second phase of 
the ExaFSA project in order to demonstrate that the Xevolver approach is effective 
for optimizing real-world applications in practice. The Xevolver approach assumes 
that an HPC application is developed by a team work of at least two kinds of 
programmers. One is application developers and the other is performance engineers. 
Application developers are interested in simulation results rather than performance, 
while performance engineers are mainly focusing on sustained simulation perfor- 
mance. Therefore, Japanese researchers have worked as performance engineers 
using Xevolver by considering German research groups as application developers. 

The ExaFSA project focused on engineering two solvers, FASTEST and Ateles, 
which have been developed in the ExaFSA project as primary building blocks of a 
practical coupled simulation. An incompressible flow solver, FASTEST, has a long 
history of development and was once optimized for classic vector machines. Thus, 
some of important kernels still have two versions, default version and its vector- 
optimized version. In the ExaFSA project, hence, they used the Xevolver framework 
to express the differences between the two versions, and demonstrated that the 
vector-optimized version can be generated by transforming the default version. 
That is, the Xevolver approach can express the system-specific code optimizations 
as code transformation rules, and thus even simplify the code while achieving 
high performance and portability. Ateles is based on based on Discontinuous 
Galerkin (DG) discretization method, and a part of the simulation framework, 
APES, was developed at the University of Siegen in Germany. Unlike FASTEST, 
Ateles is written using modern Fortran language features to hide the implementation 
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details. However, the kernel loops still need to be optimized in different ways for 
individual system architectures to achieve high performance. For example, some 
loop optimizations with compiler directives are mandatory for the NEC SX-ACE 
vector computing system to properly vectorize and thus efficiently execute the 
loops. In this project, Xevolver is used to apply the loop optimizations without 
major modifications of the original code. Accordingly, the ExaFSA project was a 
very good opportunity for us to demonstrate that the Xevolver approach can help 
an appropriate division of labor between application developers and performance 
engineers by achieving separation of concerns. This clarification of role-sharing will 
be very helpful for long-term application development especially in an upcoming 
extreme-scale computing era. 


The Role of Japan in HPC Collaborations 


The SPPEXA program was unique in the sense that it established sustainable 
connections in the field of HPC between France, Germany, and Japan. With the 
supercomputing infrastructure in Japan (and its upcoming flagship supercomputer 
Fugaku), the three countries are suitable partners for portability and methodology 
comparisons and, thus, synergistic research developments (such as within SPPEXA 
connection). 

A new field of interest in Japan, Germany, and France is data science and its 
connection to HPC. SPPEXA participated in the tri-lateral workshop in Tokyo 
“Convergence of HPC and Data Science for Future Extreme Scale Intelligent 
Applications”, where we discussed new possible collaborations in the fields of HPC 
and Big Data. Looking back at SPPEXA, we see many success stories, and hope for 
a lot of continuing collaborations. 
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Sebastian Oeste, Marc-André Vef, Mehmet Soysal, Wolfgang E. Nagel, 
André Brinkmann, and Achim Streit 


Abstract Today’s High-Performance Computing (HPC) environments increasingly 
have to manage relatively new access patterns (e.g., large numbers of metadata 
operations) which general-purpose parallel file systems (PFS) were not optimized 
for. Burst-buffer file systems aim to solve that challenge by spanning an ad hoc file 
system across node-local flash storage at compute nodes to relief the PFS from such 
access patterns. However, existing burst-buffer file systems still support many of the 
traditional file system features, which are often not required in HPC applications, at 
the cost of file system performance. 

The ADA-FS project aims to solve that challenge by providing a temporary burst- 
buffer file system—GekkoFS—which relaxes POSIX, based on previous usage 
studies of how HPC applications use file systems. Due to a highly distributed and 
decentralized design GekkoFS reaches scalable data and metadata performance with 
tens of millions of metadata operations per second on a 512 node cluster. The 
ADA-FS project further investigated the benefits of using ad hoc file systems and 
how they can be integrated into the workflow of supercomputing environments. In 
addition, we explored how to gather application-specific information to optimize the 
file system for an individual application. 
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1 Introduction 


Application-imposed workloads on High-Performance computing (HPC) environ- 
ments have considerably changed in the past decade. While traditional HPC applica- 
tions have been compute-bound, large-scale simulations, today’s HPC applications 
are also generating, processing, and analyzing massive amounts of experimen- 
tal data—known as data-driven science applications—affecting several scientific 
fields. Some of which have already made significant progress in previously unad- 
dressable challenges due to newly discovered techniques [27, 55]. 

Many data-driven workloads are based on new algorithms and data structures 
which impose new requirements on HPC file systems [45, 77]. Particularly, large 
numbers of metadata operations, data synchronization, non-contiguous and random 
access patterns, and small I/O requests [14, 45], used in data-driven science 
applications, are challenging for today’s general-parallel file systems (PFSs) to 
handle since past workloads mostly perform sequential I/O operations on large 
files. Not only are such applications disruptive to the shared storage system but 
also heavily interfere with other applications which access the same shared storage 
system [18, 68]. As a result, many workloads which impose these new types of I/O 
operations suffer from prolonged I/O latencies, reduced file system performance, 
and occasional long wait times. 

Software-based approaches, e.g., application modifications or middleware and 
high-level libraries [21, 39], and hardware-based approaches, moving from mag- 
netic disks to NAND-based solid-state drives (SSDs) within PFSs, are attempts to 
mitigate the impact of these new access patterns on the HPC system. However, 
software-based approaches often suffer from time-consuming adaptations within 
applications and are sometimes (based on the underlying algorithms) even impos- 
sible to adapt to. One of the hardware-based approaches leverages on, nowadays, 
existing SSDs, installed within a compute node, in order to use them as node- 
local burst buffers. To achieve high metadata performance, they can be deployed 
in combination with a dynamic burst buffer file system [5,78]. Nonetheless, existing 
burst buffer file systems have been mostly POSIX compliant which can severely 
reduce a file system’s peak performance [75]. 

The ADA-FS project, funded by the German Research Foundation (DFG) 
through the Priority Programme 1648 “Software for Exascale Computing”, aims 
to further explore the possibilities of burst buffer file systems in this context while 
investigating how they can be used in a modern HPC system. The developed burst 
buffer file system—GekkoFS—acts as ADA-FS’ main component. GekkoFS is a 
temporarily deployed, highly-scalable distributed file system for HPC applications 
which aims to accelerate I/O operations of common HPC workloads that are 
challenging for modern PFSs. As such, it can be used in several temporary use cases, 
such as the lifetime of a compute job or in longer-term use cases, e.g., campaigns. 
Unlike previous works on burst buffer file systems, it relaxes POSIX by removing 
some of the semantics that most impair I/O performance in a distributed context 
and takes previous studies on the behavior of HPC applications into account [37] to 
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optimize the most used file system operations. As a result, GekkoFS reaches scalable 
data and metadata performance with tens of millions of metadata operations per 
second on a 512 node cluster while still providing strong consistency for file system 
operations that target a specific file or directory. In fact, due to its highly distributed 
and decentralized file system design, GekkoFS is built to perform on even bigger 
supercomputers, as exascale environments are right around the corner. 

While GekkoFS provides the core building block within ADA-FS, it relies and 
benefits from further information of the application it is used with. Application- 
specific information that we gather can then further optimize the file system (e.g., 
the used file system block size) and therefore may increase the file system’s 
performance in terms of latency and throughput. In addition, the ADA-FS project 
investigated how such a temporary and on demand burst buffer file system can be 
integrated into the workflow of batch systems in supercomputing environments. 
Although it is hard to reliably predict when compute jobs finish to prematurely 
deploy GekkoFS for a following ADA-FS job, for instance, we investigated and 
showed the benefits of on demand burst buffer file systems concerning both 
application performance and the reduction of the PFS load as a result of using such 
a file system. 

The article is structured as follows: first, we describe GekkoFS’ design and its 
evaluation of nowadays common and challenging HPC workloads on a 512 node 
cluster in Sect. 2. Section 3 discusses the existing challenges when data is staged 
in advance and how we solved the challenge, through implementing a plugin for 
the batch system. Section 4 discusses how we can detect system resources like the 
amount of node local storage or the NUMA configuration of a node which can be 
used for the deployment of the GekkoFS file system even on heterogenous compute 
nodes. In Sect. 5 we show how the option for an on demand file system, can be added 
to an HPC system. We follow with an evaluation of the performance of GekkoFS 
for new NVME based storage systems in Sect. 6. Finally, we conclude in Sect. 7. 


2 GekkoFS—A Temporary Burst Buffer File System 
for HPC 


In this section, we present the main component of ADA-FS—GekkoFS. GekkoFS 
is a temporarily deployed, highly-scalable burst buffer file system for HPC appli- 
cations. In general, the goal of GekkoFS is to accelerate I/O operations in common 
HPC workloads that are challenging for modern PFSs while offering the combined 
storage capabilities of node-local storage devices. Further, it does not only aim for 
providing scalable I/O performance, but, in particular, focuses on offering scalable 
metadata performance by departing from traditional ways of handling metadata in 
distributed file systems. To provide a single, global namespace, accessible to all file 
system nodes, the file system pools together fast node-local storage resources of all 
participating file system nodes. 
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Based on previous studies [37] on the behavior of HPC applications, GekkoFS 
relaxes or removes some of the POSIX semantics, known to heavily impact I/O 
performance in a distributed environment. As a result, it is able to optimize for the 
most used file system operations, achieving tens of millions of metadata operations 
per second on a 512 node cluster. At the same time, GekkoFS is able to run complex 
applications, such as OpenFOAM solvers [32], and since the file system runs in user- 
space and it can be easily deployed in under 20s on a 512 node cluster, it is usable 
by any user. Consequently, GekkoFS can be used for several use cases which require 
an ephemeral distributed file system, such as during the lifetime of a compute job or 
campaigns where data is simultaneously accessed by many nodes in short bursts. 

Parts of this section’s contents is build on the conference paper by the authors M.- 
A. Vef et al. [72] and the journal article by the authors M.-A. Vef et al. [71] which 
both discuss each of the system components of GekkoFS in more detail and provide 
an in-depth investigation into the performance of GekkoFS compared to other file 
systems in various HPC environments. First, Sect.2.1 provides a background on 
parallel and distributed file systems and discusses some of the related work in the 
context of burst buffer file systems. Section 2.2 presents the file system’s core 
architecture and design to achieve scalable data and metadata performance in a 
distributed environment. Finally, in Sect.2.3 we demonstrate GekkoFS data and 
metadata performances. 


2.1 Related Work 


In this section, we give an overview over existing HPC file systems and discuss the 
differences to GekkoFS. 


2.1.1 General-Purpose Parallel File Systems 


Most HPC systems are equipped with a backend storage system which is globally 
accessible using a parallel file system (e.g., GPFS [57], Lustre [7,53], BeeGFS [26], 
or PVFS [56]). These file systems offer a POSIX-like interface and focus on data 
consistency and long-term storage. However, due to the nature of the file system 
being globally accessible, single applications can disrupt the I/O performance of 
other applications as well. In addition, these file systems are not well suited for small 
file accesses, in particular on shared files, often found in scientific applications [45]. 

The design of GekkoFS does not focus on long-term storage and aims for 
temporary use cases, such as in the context of compute jobs or campaigns. In 
addition, since GekkoFS relaxes POSIX semantics, it is able to provide a significant 
increase in metadata performance. 
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2.1.2 Node-Local Burst Buffers 


Burst buffers are fast, intermediate storage systems that aim to reduce the load on 
the global file system and on reducing an applications’ I/O overhead [38]. Such 
burst buffers can be categorized into two groups [78]: remote-shared and node- 
local. Remote-shared burst buffers are generally dedicated I/O nodes to forward 
application I/O to the underlying PFS, e.g., DDN’s IME! and Cray’s DataWarp.” 

Node-local burst buffers, on the other hand, are collocated with compute nodes, 
using existing node-local storage. This node-local storage is then used to create a 
(distributed) file system which spans over a number of nodes for the lifetime of a 
compute job, for example. Node-local burst buffers can also be dependent on the 
PFS (e.g., PLFS [5]) or are sometimes even managed directly by the PFS [49]. 

BurstFS [78], perhaps the most related work to ours, is a standalone burst buffer 
file system which does not require a centralized instance as well. However, GekkoFS 
is not limited to writing data locally like BurstFS. Instead, all data is distributed 
across all participating file system nodes to balance data workloads for write and 
read operations without sacrificing scalability. BeeOND [26] can create a job- 
temporal file system on a number of nodes similar to GekkoFS. BeeOND is, in 
contrast to our file system, POSIX compliant and our GekkoFS measurements show 
a much higher metadata throughput than offered by BeeOND [69, 71]. 


2.1.3 Metadata Scalability 


The management of inodes (containing a file’s metadata) and related directory 
blocks (containing data about which files belong to the directory) are the main 
scalability limitations of file systems in a distributed environment [73]. Typically, 
general-purpose PFSs distribute data across all available storage targets. While 
this technique works well for data, it does not achieve the same throughput when 
handling metadata [11, 54], although the file system community presented various 
techniques to tackle this challenge [5, 22, 50,51, 79, 80]. The performance limitation 
can be attributed to the sequentialization enforced by underlying POSIX semantics 
which is particularly degrading throughput when an extremely large number of files 
is created in a single directory from multiple processes. This workload, common 
to HPC environments [5, 49, 50, 74], can become an even bigger challenge for 
upcoming data-science applications. GekkoFS handles directories and replaces 
directory entries by objects, stored within a strongly consistent key-value store 
which helps to achieve tens of millions of metadata operations for billions of files. 


'TME: https://www.ddn.com/products/ime-flash-native-data-cache/. 
Datawarp: https://www.cray.com/datawarp. 
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2.2 Design 


In this section, we present goals, architecture, and general design of GekkoFS 
which allows scalable data and metadata performance. In general, any user without 
administrative access should be able to deploy GekkoFS. The user dictates on 
how many compute nodes and at which path the mountpoint of GekkoFS and 
its metadata and data is stored. The user is then presented with a single global 
namespace, consisting of the aggregated node-local storage of each node. To provide 
this functionality GekkoFS aims to achieve four core goals: 


Scalability: | GekkoFS should be able to scale with an arbitrary number of nodes 
and efficiently use available hardware. 

Consistency model: GekkoFS should provide the same strong consistency as 
POSIX for common file system operations that access a specific data file. 
However, the consistency of directory operations, for example, can be relaxed. 

Fast deployment: To avoid wasting valuable and expensive resources in HPC 
environments, the file system should startup within a minute and be ready for 
usage immediately by applications after the startup succeeds. 

Hardware independence: GekkoFS should be able to support networking hard- 
ware that is commonly used in HPC environments, e.g., Omni-Path or Infiniband. 
The file system should be able to use the native networking protocols to 
efficiently move data between file system nodes. Finally, GekkoFS should work 
with modern and future storage technologies that are accessible to a user at an 
existing file system path. 


2.2.1 POSIX Semantics 


Similarly to PVFS [12] and OrangeFS [42], GekkoFS does not provide complex 
global locking mechanisms. In this sense, applications should be responsible to 
ensure that no conflicts occur, in particular, concerning overlapping file regions. 
However, the lack of distributed locking has consequences for operations where the 
number of affected file system objects is unknown beforehand, e.g., readdir () 
called by the 1s -1 command. In these indirect file system operations, GekkoFS 
does not guarantee to return the current state of the directory and follows the 
eventual-consistency model. Furthermore, each file system operation is synchronous 
without any form of caching to reduce file system complexity and to allow for an 
evaluation of its raw performance capabilities. 

Further, GekkoFS does not support move or rename operations or linking 
functionality as HPC application studies have shown that these features are rarely 
or not used at all during the execution of a parallel job [37]. Such unsupported file 
system operations then trigger an I/O error to notify an application. Finally, security 
management in the form of access permissions is not maintained by GekkoFS since 
it already implicitly follows the security protocols of the node-local file system. 
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Fig. 1 GekkoFS architecture 


2.2.2 Architecture 


The architecture of GekkoFS (see Fig. 1) consists of two main components: a client 
library and a server process. An application that uses GekkoFS must first preload the 
client interposition library which intercepts all file system operations and forwards 
them to a server (GekkoFS daemon), if necessary. The GekkoFS daemon, which runs 
on each file system node, receives forwarded file system operations from clients and 
processes them independently, sending a response when finished. In the following 
paragraphs, we describe the client and daemon in more detail. 


2.2.3 GekkoFS Client 


The client consists of three components: (1) An interception interface that catches 
relevant calls to GekkoFS and forwards unrelated calls to the node-local file system; 
(2) a file map that manages the file descriptors of open files and directories, 
independently of the kernel; and (3) an RPC-based communication layer that 
forwards file system requests to local/remote GekkoFS daemons. 

Each file system operation is forwarded via an RPC message to a specific daemon 
(determined by hashing of the file’s path, similar to Lustre DNE 2° ) where it is 
directly executed. In other words, GekkoFS uses a pseudo-random distribution to 
spread data and metadata across all nodes, also known as wide-striping. Because 
each client is able to independently resolve the responsible node for a file system 
operation, GekkoFS does not require central data structures that keep track of where 
metadata or data is located. To achieve a balanced data distribution for large files, 


3https://lustre.ornl.gov/ecosystem-2016/documents/papers/LustreEco2016-Simmons-DNE.pdf. 
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data requests are split into equally sized chunks before they are distributed across 
file system nodes (or GekkoFS daemons). The GekkoFS daemons then store each 
received chunk in a separate file (so-called chunk files) in its underlying node- 
local storage. If supported by the underlying network fabric protocol, the client 
exposes the relevant chunk memory region to the daemon, accessed via remote- 
direct-memory-access (RDMA). 


2.2.4 GekkoFS Daemon 


GekkoFS daemons consist of three parts: (1) A key-value store (KV store) used 
for storing metadata; (2) an I/O persistence layer that reads/writes data from/to the 
underlying local storage system; and (3) an RPC-based communication layer that 
accepts local and remote connections to handle file system operations. 

Each daemon operates a single local RocksDB KV store [17]. RocksDB is 
optimized for NAND storage technologies with low latencies and fits GekkoFS’ 
needs as SSDs are primarily used as node-local storage in today’s HPC clusters. 
While RocksDB fits this use case well, the component is replaceable by other 
software or hardware solutions. Therefore, GekkoFS may introduce various choices 
for backends in the future to, for example, support recent key-value SSDs* 

For the communication layer, we leverage on the Mercury RPC framework [62]. 
It allows GekkoFS to be network-independent and to efficiently transfer large data 
within the file system. Within GekkoFS, Mercury is interfaced indirectly through 
the Margo library which provides Argobots-aware wrappers to Mercury’s API with 
the goal to provide a simple multi-threaded execution model [13, 58]. Using Margo 
allows GekkoFS daemons to minimize resource consumption of Margo’s progress 
threads and handlers which accept and handle RPC requests [13]. 

Further, as indicated in Sect. 2.1.3, GekkoFS does not use a global locking man- 
ager. Therefore, when multiple processes write to the same file region concurrently, 
they may cause a shared write conflict with resulting undefined behavior with 
regards to which data is written to the underlying node-local storage. Such conflicts 
can, however, be handled locally by any GekkoFS daemon because it is using a 
POSIX-compliant node-local file system to store the corresponding data chunks, 
serializing access to the same chunk file. Note that such conflicts in a single file 
only affect one chunk at a time since the file’s data is spread across many chunk 
files in the file system. As a result, chunks of that file are not disrupted during such 
a potential shared write conflict. 


4https://www.samsung.com/semiconductor/global.semi.static/Samsung_Key_WValue_SSD_enables 
_High_Performance_Scaling-0.pdf. 
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2.3 Evaluation 


In this section, we evaluate the performance of GekkoFS based on various unmod- 
ified microbenchmarks which catch access patterns that are common in HPC 
applications. First, we describe the experimental setup and introduce the workloads 
that we simulate with microbenchmark applications. Then, we investigate the startup 
time of GekkoFS and compare metadata performance against a Lustre parallel 
file system. Although GekkoFS and Lustre have different goals, we point out the 
performances that can be gained by using GekkoFS as a burst buffer file system. 
Finally, we evaluate the data performance of GekkoFS and discuss the measured 
results. 


2.3.1 Experimental Setup 


We evaluated the performance of GekkoFS based on various unmodified 
microbenchmarks which catch access patterns that are common in HPC 
applications. Our experiments were conducted on the MOGON II supercomputer, 
located at the Johannes Gutenberg University Mainz in Germany. All experiments 
were performed on Intel 2630v4 Intel Broadwell processors (two sockets each). 
The node main memory capacity ranges from 64 GiB up to 512 GiB. MOGON II 
uses 100 Gbit/s Intel Omni-Path to establish a fat-tree network between all compute 
nodes. In addition, each node provides a data center Intel SATA SSD DC $3700 
Series as scratch-space (XFS formatted) usable within a compute job. We used these 
SSDs for storing data and metadata of GekkoFS which uses an internal chunk size 
of 512 KiB. All Lustre experiments were performed on a Lustre scratch file system 
with 12 Object Storage Targets (OSTs), 2 Object Storage Servers (OSSs), and 1 
Metadata Service (MDS) with a total of 1.2 PiB of storage. 

Before each experiment iteration, GekkoFS daemons are restarted (requiring less 
than 20s for 512 nodes), all SSD content is removed, and kernel buffer, inode, and 
dentry caches are flushed. The GekkoFS daemon and the application under test are 
pinned to separate processor sockets to ensure that file system and application do 
not interfere with each other. 


2.3.2 Metadata Performance 


We simulated common metadata intensive HPC workloads using the unmodified 
mdtest microbenchmark [41] to evaluate GekkoFS’ metadata performance and 
compare it against a Lustre parallel file system. Although GekkoFS and Lustre have 
different goals, we point out the performances that can be gained by using GekkoFS 
as a burst buffer file system. In our experiments, mdtest performs create, stat, and 
remove operations in parallel in a single directory—an important workload in many 
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Fig. 2 GekkoFS’ file create, stat, and remove throughput for an increasing number of nodes 
compared to a Lustre file system 


HPC applications and among the most difficult workloads for a general-purpose 
PFS [74]. 

Each operation on GekkoFS was performed using 100,000 zero-byte files per 
process (16 processes per node). From the user application’s perspective, all created 
files are stored within a single directory. However, due to GekkoFS’ internally 
kept flat namespace, there is conceptually no difference in which directory files 
are created. This is in contrast to a traditional PFS that may perform better if the 
workload is distributed among many directories instead of in a single directory. 

Figure 2 compares GekkoFS with Lustre in three scenarios with up to 512 
nodes: file creation, file stat, and file removal. The y-axis depicts the corresponding 
operations per second that were achieved for a particular workload on a logarithmic 
scale. Each experiment was run at least five times with each data point representing 
the mean of all iterations. GekkoFS’ workload scaled with 100,000 files per process, 
while Lustre’s workload was fixed to four million files for all experiments. We fixed 
the number of files for Lustre’s metadata experiments because Lustre was otherwise 
detecting hanging nodes when scaling to too many files. 

Lustre experiments were run in two configurations: All processes operated in 
a single directory (single dir) or each process worked in its own directory 
(unique dir). Moreover, Lustre’s metadata performance was evaluated while the 
system was accessible by other applications as well. 

As seen in Fig. 2, GekkoFS outperforms Lustre by a large margin in all scenarios 
and shows close to linear scaling, regardless of whether Lustre processes operated 
in a single or in an isolated directory. Compared to Lustre, GekkoFS achieved 
around 46 million creates/s (~1405 x), 44 million stats/s (~359x), and 22 million 
removes/s (~453x) on 512 nodes. The standard deviation was less than 3.5% 
which was computed as the percentage of the mean. Therefore, we achieve our 
scalability goal, demonstrating the performance benefits of distributing metadata 
and decoupling directory entries from non-scalable directory blocks (see Sect. 2.2). 

Additional GekkoFS experiments were also run while Mogon II was used by 
other users during production, revealing network interference within the cluster. 
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With up to 128 nodes we were unable to measure a difference in metadata operation 
throughput outside of the margin for error compared to the experiments in an 
undisturbed environment (see Fig. 2). For 256 and 512, we measured a reduced 
metadata operation throughput between 10 and 20% for create and stat operations. 
Remove operation throughput remained unaffected. 

Lustre’s metadata performance did not scale beyond approximately 32 nodes, 
demonstrating the aforementioned metadata scalability challenges in such a general- 
purpose PFS. Moreover, processes in Lustre experiments that operated within their 
own directory achieved a higher performance in most cases, except for the remove 
case where Lustre’s unique dir remove throughput is reduced by over 70% at 
512 nodes compared to Lustre’s single dir throughput. This is because the time 
required to remove the directory of each process (in which it creates its workload) is 
included in the remove throughput and the number of created unique directories 
increases with the number of used processes in an experiment. Similarly, the time to 
create the process directories is also included in the create throughput but does not 
show similar behavior to the case of the remove throughput, indicating optimizations 
towards create operations. 


2.3.3 Data Performance 


We used the unmodified JOR [31] microbenchmark to evaluate GekkoFS’ I/O 
performance for sequential and random access patterns in two scenarios: Each 
process is accessing its own file (file-per-process) and all processes access a single 
file (shared file). We used 8 KiB, 64 KiB, | MiB, and 64 MiB transfer sizes to assess 
the performances for many small I/O accesses and for few large I/O requests. We 
ran 16 processes on each client, each process writing and reading 4 GiB in total. 

GekkoFS data performance is not compared with the Lustre scratch file system 
as the peak performance of the used Lustre partition, around 12 GiB/s, is already 
reached for <10 nodes for sequential I/O patterns. Moreover, Lustre has shown to 
scale linearly in larger deployments with more OSSs and OSTs being available [48]. 

Figure 3 shows GekkoFS’ sequential I/O throughput in MiB/s, representing the 
mean of at least five iterations, for an increasing number of nodes for different 
transfer sizes. In addition, each data point is compared to the peak performance 
that all aggregated SSDs could deliver for a given node configuration, visualized 
as a white rectangle, indicating GekkoFS’ SSD usage efficiency. In general, every 
result demonstrates GekkoFS’ close to linear scalability, achieving about 141 GiB/s 
(~80% of the aggregated SSD peak bandwidth) and 204GiB/s (~70% of the 
aggregated SSD peak bandwidth) for write and read operations for a transfer size of 
64 MiB for 512 nodes. 

Figure 4 shows GekkoFS’ throughput for random accesses for an increasing 
number of nodes, showing close to linear scalability in all cases. The file system 
achieved up to 141 GiB/s write throughput and up to 204 GiB/s read throughput for 
64 MiB transfer sizes at 512 nodes. 
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Fig. 3 GekkoFS’ sequential throughput for each process operating on its own file compared to the 
plain SSD peak throughput. (a) Write throughput. (b) Read throughput 


For the file-per-process cases, sequential and random access I/O throughput are 
similar for transfer sizes larger than the file system’s chunk size (512 KiB). This is 
due to transfer sizes larger than the chunk size internally access whole chunk files 
while smaller transfer sizes access one chunk at a random offset. Consequently, 
random accesses for large transfer sizes are conceptually the same as sequential 
accesses. For smaller transfer sizes, e.g., 8 KiB, random write and read throughput 
decreased by approximately 33 and 60%, respectively, for 512 nodes owing to the 
resulting random access to positions within the chunks. 

For the shared file cases, a drawback of GekkoFS’ synchronous and cache- 
less design becomes visible. No more than approximately 150K write operations 
per second were achieved. This was due to network contention on the daemon 
which maintains the shared file’s metadata whose size needs to be constantly 
updated. To overcome this limitation, we added a rudimentary client cache to locally 
buffer size updates of a number of write operations before they are sent to the 
node that manages the file’s metadata. As a result, shared file I/O throughput for 
sequential and random access were similar to file-per-process performances since 
chunk management on the daemon is then conceptually indifferent in both cases. 
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Fig. 4 GekkoFS’ random throughput for each process operating on its own file. (a) Write 
throughput. (b) Read throughput 


3 Scheduling and Deployment 


In order to transfer the data to a previously generated on demand file system in 
time, the nodes that will be allocated to a job must be known in advance. Today’s 
schedulers plan the resources of a supercomputer. The schedule is based on user 
requested wall times. Reality shows that the users requested wall times are very 
inaccurate, and thus the scheduler’s predictions are unreliable. 

Here two investigations were made and published. In the first work, we have 
shown that we can improve wall time estimates based on simple job metadata. We 
also used unconsidered metadata that is usually not publicly available [65]. 

Predicting the run times of jobs is only one aspect of the challenge. However, 
the essential factor is the prediction of node allocation to a job. In this second 
investigation, we have determined the influence of the wall-time on the node 
prediction [64]. The question we wanted to answer—How good do wall time 
predictions have to be to predict the allocated nodes accurately? 
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3.1 Walltime Prediction 


One of the challenges is to know which nodes are going to be allocated to a queued 
job. The HPC scheduler predicts these nodes based on the user given wall times. 
Therefore, we have decided to evaluate whether there is an easy way to predict such 
wall time automatically. Our proposed approach for wall time prediction is to train 
an individual model for every user. For this, we used methods from the machine 
learning domain and added job metadata, which was in previous work unconsidered. 
As historical data, we used workloads from two HPC-systems at the Karlsruhe 
Institute for Technology/Steinbuch Centre for Computing [66], the ForHLR I + 
II [34, 35] clusters. To train the model, we used automatic machine learning 
(AUTOML). AUTOML automates the process of hyperparameter optimization and 
selecting the correct model. We have chosen the auto ML library auto-sklearn [20], 
which is based on scikit-learn [9, 52]. 

In Fig. 5 the comparison of the user given wall times and the wall time prediction 
is shown. As a metric, the median absolute error (medAE) in hours is depicted as 
cumulative distribution. A model trained with AUTOML shows for 60% of the users 
a medAE of approximately 1h on the ForHLR I and 1.4h for the ForHLR II. The 
user estimations show a medAE deviation of about 7.4h on both clusters. So we 
are able to reduce the median absolute deviation from 7.4 down to 1.4h in average. 
Considering the fact that simple methods were used and no insight was provided 
into the job payload, this result is very good. 
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Fig. 5 Comparison of median absolute error (medAE) for ForHLR I+II. X-axis Median absolute 
error in hours, Y-Axis cumulative distribution 
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3.2 Node Prediction 


As mentioned before predicting the run times of jobs is only one aspect of the 
challenge. However, the decisive factor is the accuracy of node allocation prediction. 
In this subsequent investigation, we have determined the impact on the node 
allocation accuracy with improved wall times. Therefore the ALEA Simulator [2] 
has been extended to simulate the time of the node allocation list [64]. 

We have conducted several simulations with subsequently improved job run time 
estimates, from inaccurate wall times as provided by users to fully accurate job run 
time estimates. For this purpose, we introduce Treq, the “refined” requested wall 
time, 


Treq = TRun + A(TReq — TRun) with A € [0, 1], (1) 


where TReq is the user requested wall time and Trun is the run time of the job. 
To effectively simulate different precision of requested wall times, each job in the 
workload is modified by the same 2. 

The result of the simulation is shown in Fig. 6, each bar represents a simulation 
with a different A value. The bars are categorized into four groups based on the valid 
node allocation prediction (TNap). The blue part represents the jobs that are started 
immediately (instant) execution after the job is submitted to the batch system. These 
instantly started jobs offer of course no time to create a file system or even stage 
data. The orange part represents queued jobs with a Tyap between 0 and 1 s. The 
green part shows jobs with a Tyap from one second up to 10 min and red indicates 
long term prediction with a valid node allocation prediction over 10 min. The class 
of jobs with long-term predictions (red) is in our focus. This long-term predictions 
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Fig. 6 Job distributions of ForHLR II workload with back-filling (CONS). Blue color denotes 
instant jobs, orange color means job having prediction <1 s, green color denotes jobs with 1 and 
600s and red color denotes long-term predictions(<600 s) 
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increase significantly only at very small à < 0.1 which proves that very good run 
time estimates are needed. 


3.3. On Demand Burst Buffer Plugin 


From both evaluations, it is clear, that advanced data staging based on the scheduler 
prediction is not possible. Also, by using state-of-the-art methods such as machine 
learning, the accuracy is not sufficient. Therefore we decided to extend the 
functionality of the SLURM [15] scheduler. Slurm has a feature to manage burst 
buffers [16]. However, the current implementation status only includes support 
for the Cray DataWarp solution. Management of burst buffers using other storage 
technologies is documented, but not yet implemented. With the developed plugin, 
we extend the functionality of SLURM to create a file system on demand. For 
the prototype implementation, we also developed tools which deploy BeeOND 
(BeeGFS On Demand) as an on demand file system per job. Other parallel file 
systems,e.g. Lustre [7] or GekkoFS, can be added easily. The user requests an on 
demand file system by a job flag. He can also specify if data should be staged in 
and out. The SLURM controller marks the jobs and then does the corresponding 
operations [76]. 


3.4 Related Work 


The requested wall times are unfortunately far away from the real used wall time. 
Gibbons [23, 24], and Downey [19] used historical workloads to predict the wall 
times of parallel applications. They predict wall times based on templates. These 
templates are created by analyzing previously collected metadata and grouped 
according to similarities. However, both approaches are restricted to simple defi- 
nitions. 

In the recent years, machine learning algorithms have been used to predict 
resource consumption in several studies [33, 40, 43, 44, 61, 70]. 

Predicting the run-time of jobs is also important in different topics, like for 
energy aware scheduling [3]. Here the applications’ power and performance char- 
acteristics are considered to provide an optimized trade off between energy savings 
and job execution time. 

However, all of the above mentioned studies do not try to evaluate the accuracy 
of the node allocation predictions. Most of the publications focus on observing the 
utilization of the HPC system and the reliability of the scheduler estimated job start 
times. In our work, we focus on the node allocation prediction and how good wall 
time estimates have to be. This directly affects, whether a cross-node, on demand, 
independent parallel FS can be deployed, and data can be pre-staged, or not. 
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4 Resource and Topology Detection 


Compute nodes of modern HPC systems tend to get more heterogeneous. To plan 
a proper deployment of the GekkoFS file system on the compute nodes, knowledge 
of the underlying storage components are vital. This section describes what kind 
of resource information is of interest and shows possible ways to gather this 
information. Further, we discuss the architecture of the sysmap tool that we build to 
collect relevant information. 

When thinking about the resources of a compute node, we distinguish between 
static and dynamic resource usage. Static resource information describes compo- 
nents that do not change frequently and are often similar between nodes. This 
includes the number of CPU cores, the amount of main memory, the number and 
capacity of node-local storage devices, or the type of file system. It is unlikely that 
this kind of hardware is replaced frequently. Otherwise different parts of a Cluster 
may have different configurations, e.g., one island of a Cluster may have more 
RAM than another island. On the other hand, dynamic resource usage describes 
the available resources at a certain point in time. 

The goal is to hold a map of the resources available on a system. On the one hand, 
this can be used as an input for the data staging. On the other hand, such information 
is useful for the deployment of the file system. When the job scheduler has decided 
on which set of nodes a job will run, available hardware resources can be queried 
and an appropriate configuration to deploy the file system can be selected. 

The sysmap tool can utilize existing hardware discovery libraries such as 
hwloc [8, 25] by using their interface. While hwloc does an excellent job for 
computing-related artifacts like the number of CPUs or cache sizes, it does not 
focus on the storage subsystem. Therefore, we use information from the /proc 
and /sys pseudo file systems to get information about the system. By reading 
the system configuration, the sysmap tool gathers information about partitions, 
mountpoints, file systems but also about available kernel modules and I/O-scheduler 
configuration. Moreover, we gather network information for InfiniBand networks by 
utilizing the well-known ibnetdiscover tool from the OFED distribution [47]. 


4.1 Design and Implementation 


We have designed an extensible architecture for our sysmap tool. Each 
resource of interest is captured by a so-called extractor. Figure 7 shows a 
schematic UML-diagram of two extractors. Each extractor module consists 
of an abstract part, which defines the structure of the data that will be 
gathered and a specialized part which implements the logic to read the 
data from a specific source, by overriding the abstract interface. In Fig. 7, 
the Filesystem Extractor and the Disk Extractor are examples 
of the abstract parts. The Linux::Filesystem_Extractor and the 
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Fig. 7 Simple UML-Diagram for two example extractor modules of our system-map tool 


AIX::Filesystem_Extractor are the specialized parts for extracting 
information of mountpoints and partitions of a specific system [46]. This is useful 
because the same information may be available on different systems through 
different sources. On the one hand, the user has to define the abstract extractor he 
wants, and the sysmap tool selects the source depending on what is available on the 
target system. On the other hand, we can implement specialized extractor modules 
for different sources resulting in an equivalent representation of the data for our 
tool. After gathering the data, the sysmap tool provides a wide variety of output 
formats presenting the data to the user. Since the tool is mentioned to be executed 
on multiple compute nodes, the recommended way is to store the results in a central 
database. Figure 8 depicts an overview of the general workflow of the resource 
discovery process. The sysmap tool runs on the compute nodes and gathers the 
resource information. Afterwards, the collected data is stored in a central resource 
database. For our working prototype, we use a sqlite? database. The information 
can be queried by the sysquery tool, which queries the resource database and 
outputs the selected data in JSON format. This way the querying component gets 
a machine-readable section of the required data which can be easy post-processed 
for they need. Further, the particular database query remains hidden from the user 
inside the sysquery tool. The datamodel of the resource database is shown in Fig. 9 
and consists of four simple tables. The HostTable and ExtractorTable are to map 
the hostname or the extractor name to a numerical ID. Information extracted by an 
extractor is stored as JSON string in the DataTable. Further, a DataID is maintained 
to reference the data from an extractor. In the Host2Data table, the DatalD is 


Shttps://www.sqlite.org/index.html. 
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a 


| post processing resource information 


| sysquery 
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| 


query resource database 1 


| store topology data | 
Resource Database %-------- | sysmap | 


Fig. 8 Overview of resource discovery components, the blue components are part of the sysmap 
tool suite, the resource database is highlighted as the yellow box, the red box represents the 
querying component, in this case the Job-Scheduler 


DataTable ExtractorTable 


*HostID int "ExtractorID int *ExtractorID int 
*DatalD int "DataID int *Name string 


"Data string 


HostTable 
*HostID int 
* Hostname String 
Fig. 9 The datamodel of the resource database 


mapped with the corresponding HostID. This way, data that is equal across multiple 
nodes do not need to be stored multiple times but are easy to query. Since, the 
output of a query is a JSON string, it makes further processing and output easy for 
the calling script. 


5 On Demand File System in HPC Environment 


When using on demand file systems in HPC environments, the premise is that the 
normal operation should not be affected by the use of on demand file systems. The 
interference on other jobs should be avoided or even reduced. There should be also 
no modifications, that have a negative impact on the performance or utilization, of 
the system. 
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5.1 Deploying on Demand File System 


Usually HPC systems use a batch system, such as SLURM [60], MOAB [1], or 
LSF [30]. The batch system manages the resources of the cluster and starts the user 
jobs on allocated nodes. Before a job is started, a prologue script may be started on 
one or all allocated nodes and, if necessary, an epilogue script at the end of a job. 
These scripts are used to clean, prepare, or test the full functionality of the nodes. We 
modified these scripts to start the on demand file system upon request. During job 
submission, a user can request an on demand file system for the job. This solution 
has minimal impact on the HPC system operation. Users without the need for an on 
demand file system are not affected. An alternative way of deploying a on demand 
file system we have described in Sect. 3.3 


5.2 Benchmarks 


As initial benchmarks we tested the startup and shutdown time of the on demand 
file system (cf. Table 1). Comparing the startup time of BeeGFS on demand to the 
startup time of GekkoFS (512 Nodes under 20s) it is clear, that BeeGFS takes too 
much time for startup and shutdown at larger scales. BeeGFS has a serial section in 
its startup where a status file is created on every node sequentially. This was also 
discussed on the mailing list [63] with a possible solution to improve the behavior 
in future releases. 

In Fig.10 we show the IoZone [10] benchmark to measure the read and 
write throughput of the on demand file system (solid line). The figure shows 
that performance increases linearly with the number of used compute nodes. The 
limiting factor here is the aggregate throughout of the used SATA-SSDs. A small 
throughput variation can be observed due to normal performance scattering of 
SSDs [36]. The dotted line indicates the theoretical throughput with NVMe devices. 
Here we assumed the performance for today’s common PCIe x4 NVM devices [29] 
with a throughput of 3500/2000 MB/s of read/write performance. 

In a further test, we evaluated the storage pooling feature of BeeGFS [4]. We 
created a storage pool for each switch according to the network topology.In other 
words, when writing to a storage pool, the data is distributed via the stripe count 
and chunk size, but remains within the storage pool and thus on a switch. Figure 11 
shows the write throughput for three scenarios. Each scenario uses a different 
number of core switches with six being the full network capacity. In the first 


Table 1 BeeGFS startup and shutwdown 
Nodes 8 16 32 64 128 256 


Startup (s) 10.2 16.7 29:3 56.5 152.1 222.4 
Shutdown (s) 11.9 12.1 9.4 15.9 36.1 81.0 
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Fig. 10 Solid line: Read/write throughput. Dashed line: extrapolation with the theoretical peak of 
NVMe-SSDs 
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Fig. 11 IoZone write throughput with reduced number of core switches on 240 nodes 


experiment, with all six core switches, there is only a minimal performance loss, 
which indicates a small overhead when using storage pools. In the second case we 
turned off three switches, and in the last case we turned off five switches. With 
reduced number of core switches the write throughput drops due to the reduced 
network capacity. If storage pools are created according to the topology, it is possible 
to achieve the same performance as with all six switches. 
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5.3 Concurrent Data Staging 


We also considered the case of copying data back to the PFS while an application is 
running. For this purpose, we evaluated NAStJA [6] with concurrent data staging. To 
stage the data, during the NAStJA execution, we used the parallel copy tool dcp [59]. 
The configuration for this use-case: 


e We used 24 nodes with 20 cores per node. 

e NAStA was executed on 23 nodes with 20 tasks per node. 

e BeeOND was started on all 24 nodes using the idle node as metadata server. 
¢ Three different scenarios were evaluated during the application execution: 


— without data staging, 

— data staging using every node with one task per node for data staging, 

— data staging using the node, where only the meta-data server is running, with 
4 tasks executed on this node. 


Figure 12 shows the average execution time per time-step of five runs in our different 
scenarios. In the beginning, the slowdown is significant (orange line) due to the high 
amount of metadata operations. In this case, a portion of the data is indexed on every 
node. This indexing is causing interference with the application. When using only 
the MDS-server to copy the data (green line), the indexing is done only on the MDS- 
server. 
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Fig. 12 Average execution time per time-step (5 runs). Without data staging(blue). Concurrent 
data staging using the meta-data node (green) and using every node (orange) 
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6 GekkoFS on NVME Based Storage Systems 


Recently, new storage technologies such as NVME SSDs have been introduced to 
modern HPC systems. To evaluate the GekkoFS file system, for future systems, we 
performed some benchmarks using NVME SSDs. For demonstration we installed 
GekkoFS on the cluster Taurus [67] of TU Dresden. Taurus consists of ca. 47,000 
cores of different architectures. For the demonstration, we use 8 NVME nodes of 
the HPC-DA [28] extension of Taurus. This extension consists of 90 nodes, and a 
single node has 8 Intel SSD DC P4610 Series NVME storage with 3.2 TB capacity 
and a peak bandwidth of 3.2 GB/s. Each node has 2 sockets Intel Xeon E5-2620 
v4 with 32 cores and 64GB main memory. Further, the NVME nodes are equipped 
with two 100 Gbit/s EDR Infiniband links with a peak bandwidth of 25 GB/s each. 
This experiment aims to investigate how well GekkoFS performs on new storage 
architectures. 

We installed GekkoFS on Taurus using the Infiniband network provider. For 
our demonstration, we use 8 NVME nodes in this setup. The nodes are client 
and server in one. We assign one NVME card per node as backing storage to the 
GekkoFS daemon. This results in a distributed file system with a total capacity of 
25.6TB and a theoretical maximum bandwidth of 8 x 3.2GB/s = 25.6GB/s for 
this configuration. To measure the data throughput of GekkoFS and investigate the 
impact of different access patterns to the file system we utilize the IOR benchmark. 

We perform strong scaling tests with 8, 16, 32 and 64 processes writing and 
reading | TB of data. Therefore, we adjust the block size and transfer size for a 
different number of processes. To avoid interference, we pin the IOR processes to 
one socket while the GekkoFS daemon is pinned to the other. Before the creation 
of the GekkoFS file system the NVME devices were cleared, and a new Ext4 file 
system was created as an underlying file system on the block device. We measure 
different access patterns, file per process with sequential and random accesses and 
shared file with sequential access. To avoid measuring cache effects, we flush the 
page, inode and dentry caches of the operating system before each run. 

Figure 13 shows the sequential access pattern. In the figure, one can see that the 
write bandwidth is stable at around 22 GB/s for all runs. The variation is small, and 
the values are near to the peak bandwidth of 25 GB/s for this setup. The suitable 
write bandwidths came from the relatively large transfer sizes of 64 MB to benefit 
from RDMA. For the read bandwidth, we get values between 13 and 17 GB/s. Also 
the read bandwidth first decreases when more processes are used and then increases 
again at the 64 processes. Such a poor read bandwidth is a behavior which could 
not be observed for the other measurements on MOGON II, where read and write 
bandwidth are almost equal, and is certainly a point of further investigation. 

Figure 14 depicts the random access case. The results are similar to the sequential 
access pattern, which was expected because the internal handling of GekkoFS makes 
no difference for these cases. The write bandwidth is stable between 22 and 23 GB/s 
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Fig. 13 IOR on GekkoFS on 8 NVME nodes performing a sequential file per process access 
pattern 


and saturates the NVME SSD quite well. For the read, the achieved bandwidth is 
around 14 GB/s, the values are more stable than for the sequential case, which might 
be some cache effects. 

In Fig. 15 we can see, that even for shared access pattern the results are similar 
to the file per process access pattern. The write bandwidth is again stable at 22 GB/s 
and the read bandwidth is around 16GB/s except for the configuration with 32 
processes where the read bandwidth is lower. This is also similar to the sequential 
file per process configuration in Fig. 13. As a result, we can see that GekkoFS 
can utilize NVME SSDs and is, therefore, ready for the next generation of storage 
systems. We could figure out that the different access patterns make no difference 
for the write bandwidth. For the read bandwidth, there is some bottleneck which 
needs further investigation. At the time of writing, multiple causes are imaginable; 
for example, the network layer for Infiniband might be an issue. This could also 
explain why this problem does not occur for the tests in Mainz because they have 
other network types. 
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Fig. 14 IOR on GekkoFS on 8 NVME nodes performing a random file per process access pattern 
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Fig. 15 IOR on GekkoFS on 8 NVME nodes performing a sequential shared file access pattern 
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7 Conclusion 


The goal of the ADA-FS project was to improve I/O performance for parallel appli- 
cations. Therefore, a distributed burst buffer file system, and several components 
for deployment and data management were developed. The GekkoFS distributed 
burst buffer file system as the central part of the project was presented as a 
scalable and very flexible alternative to handle the challenging I/O patterns of 
scientific applications. Primarily through the innovative metadata management, it 
beats conservative shared parallel file systems for metadata intensive workloads by 
a margin. Thanks to its flexibility, GekkoFS offers the user an exclusive file system 
for his applications and eliminates several bottlenecks caused by the contention of 
a shared resource. In addition, GekkoFS has become a basis in the EU-funded Next 
Generation I/O for Exascale (NEXTGenIO) project where it will be continuously 
and collaboratively developed to support future storage technologies as well, such 
as persistent memory. 

For successful data staging, investigations about the precision of the user- 
provided wallclock time of jobs were made. We could show how to improve 
wallclock estimates by considering the metadata of a job, and show a way to 
integrate the process of deployment and data staging into the job scheduler. Further, 
we present a tool suite to collect information about hardware resources of a compute 
node to support the deployment in a flexible manner. 

Another topic that was not covered here is the analysis of the required POSIX 
semantics of parallel applications. These insights show during the design of the file 
system which operations are required to run scientific workloads. Further, its results 
can help the user to decide for a storage system that fits his needs best. 

The evaluations showed that GekkoFS provides close to linear data and metadata 
scalability up to 512 nodes with tens of millions of metadata operations. Due to the 
decentralized and distributed design, the file system is set to be used in even larger 
environments as exascale environments are in close reach. Even on the latest storage 
infrastructure, GekkoFS can operate out of the box at the peak bandwidth at least 
for write operations. 

Following this project, we plan further improvements on GekkoFS, for example, 
caching offers possibilities to gain even more performance. Another topic that we 
want to keep working on is the integration of GekkoFS into the job schedulers of 
the systems and the workflows of the user. 

Conclusively, the project reached its goals by improving I/O performance of 
parallel applications, especially in the field of metadata intensive workloads where 
traditional parallel file systems are lacking performance. 
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of the scientist working with the code and for utilizing performance provided 
by modern and future architectures. The need for performance optimization is 
motivated by the evolution of the parallel architecture landscape from homogeneous 
flat machines to heterogeneous combinations of processors with deep memory 
hierarchy. Notably, the rise of many-core, throughput-oriented accelerators, such 
as GPUs, requires non-trivial code changes at minimum and, even worse, may 
necessitate a substantial rewrite of the existing codebase. At the same time, the code 
complexity increases the difficulty for computer scientists and vendors to understand 
and optimize the code for a given system. Storing the products of climate predictions 
requires a large storage and archival system which is expensive. Often, scientists 
restrict the number of scientific variables and write interval to keep the costs 
balanced. Compression algorithms can reduce the costs significantly but can also 
increase the scientific yield of simulation runs. In the AIMES project, we addressed 
the key issues of programmability, computational efficiency and I/O limitations 
that are common in next-generation icosahedral earth-system models. The project 
focused on the separation of concerns between domain scientist, computational 
scientists, and computer scientists. The key outcomes of the project described in this 
article are the design of a model-independent Domain-Specific Language (DSL) to 
formulate scientific codes that can then be mapped to architecture specific code and 
the integration of a compression library for lossy compression schemes that allow 
scientists to specify the acceptable level of loss in precision according to various 
metrics. Additional research covered the exploration of third-party DSL solutions 
and the development of joint benchmarks (mini-applications) that represent the 
icosahedral models. The resulting prototypes were run on several architectures at 
different data centers. 


1 Introduction 


The problems on the frontier of science requires extreme computational resources 
and data volumes across the disciplines. Examples of processes include the under- 
standing of the earth mantle [10], plasma fusion [24], properties of steel [5], and the 
simulation of weather and climate. The simulation of weather and climate requires 
to model many physical processes such as the influence of radiation from the sun and 
the transport of air and water in atmosphere and ocean [8]. As these processes are 
complex, scientists from different fields collaborate to develop models for climate 
and weather simulations. 

The mathematical model of such processes is discretized and encoded as 
computer model using numerical methods [53]. Different numerical methods can 
be used to approximate the mathematical models. A range of different numerical 
methods are used, including finite differences, finite volumes, and finite elements. 
All of these methods partition the domain of interest into small regions and apply 
stencil computations to approximate operations such as derivatives. 


AIMES 63 


The necessary computations include variables (fields) like temperature or pres- 
sure distributed spatially over some surface or space—the problem domain. Simple 
techniques divide a surface into rectangular smaller regions covering the whole 
domain. Such rectangular grids have a simple regular structure. Those grids fit 
computations well as the grid structure simply corresponds to the array notation 
of the programming languages. However, applying this grid to the globe leads to 
variable sizes of grid cells, e.g., the equator region has a coarse grid while the polar 
regions are a singularity. With such a shortcoming, rectangular grids are well suited 
for regional models but not for a global model. 

Therefore, recent models targeting global simulations are developed using differ- 
ent grids. Moving to such alternative grids allows to solve the cell area problem for 
global models, but the formulation of the models is more complicated. Icosahedral 
grids are examples of such alternatives. An icosahedral grid results from projecting 
an icosahedron onto the surface of the globe. The surface of the globe is then 
divided into twenty spherical triangles with equal areas. Grid refinement is achieved 
with recursive division of the spherical arcs into halves. The resulting points of the 
division form four smaller spherical triangles within each spherical triangle. Such 
refinement is repeated until the needed resolution is reached. Icosahedral grids can 
be used with the triangles as the basic cell, but also hexagons can be synthesized. 

Icosahedral grids have approximately uniform cell area and can be used for 
global models avoiding the cell area differences in contrast to the rectangular grids. 
However, complications arise when thinking of the technical side, where we need 
to know how to map the field data into data structures. Such technical details are 
challenging with the performance demand for the models. 

The values of a field in the simulation is localized with respect to the grid 
cell depending on the numerical formulation. In one method, values of a field 
are localized at the centers of the cells—this can be a single value or multiple 
values with higher order methods. However, other methods localize values on the 
vertices, while others reside on the edges separating the cells (see Fig. 1). How the 
cells are connected to each other, i.e., the neighbors and orientation of the cells, 
is defined in the connectivity. A problem domain can be organized in a regular 


cell 
edge 
vertex 


a) b) 


Fig. 1 Icosahedral grids and variables. (a) Triangular grid. (b) Hexagonal grid 
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fashion into so-called structured grids—following an easy schema to identify (left, 
right, ...) neighbors. Unstructured grids can follow a complex connectivity, e.g., 
using smaller and larger cells, or covering complex surfaces but require to store the 
connectivity information explicitly. The modern models explore both schemes due 
to their benefit; for instance, the structured grid can utilize compiler optimizations 
more effectively, while unstructured grids allow local refinement around areas of 
interest. 

General-purpose languages (GPL), e.g., Fortran, are widely used to encode the 
discretized computer model. The simulation of the earth with a high resolution like 
1 km? that is necessary to cover many smaller-scale physical processes, requires a 
huge computation effort and likewise storage capacity to preserve the results for 
later analysis. Due to uncertainty, scientists run a single experiment many times, 
multiplying the demand for compute and storage resources. Thus, the optimization 
of the codes for different architectures and efficiency is of prime importance to 
enable the simulations. 

With existing solutions, scientists rewrite some code sections repeatedly with dif- 
ferent optimization techniques that utilize the capabilities of the different machines. 
Hence, scientists must learn different optimization techniques for different archi- 
tectures. The code duplication brings new issues and complexities concerning the 
development and the maintainability of the code. 

Thus, the effort from the maintainers and developers of the models, who are nor- 
mally scientists and not computer scientists, is substantial. Scientists’ productivity 
is an important point to consider as they do activities that should not be their focus. 
Maintaining model codes throughout the lifecycle of the model is a demanding 
effort under all the mentioned challenges. 

The structure of the icosahedral grids brings complications not only to the 
computation, but also to the storage of the field data. In contrast to regular grids, 
where multi-dimensional array notation fits to hold the data, icosahedral grids do 
not necessarily map directly to simple data structures. Besides the challenge of 
file format support, modern models generate large amounts of data that impose 
pressure on storage systems. Recent models are developed with higher-resolution 
grids, and include more fields and processes. Simulations writing terabytes of data 
to the storage system push towards optimizing the use of the storage by applying 
data-reduction techniques. 

The development of simulation models unfolds many challenges for the scientific 
community. Relevant challenges for this project are: 


¢ Long life: The lifecycle of earth system models is long in comparison to the 
turnover of the computer technology mainly in terms of processor architectures. 

e Performance and efficiency: The need for performance and the optimal use of 
the hardware resources is an important issue. 

e Performance-portability: Models are run on different machines and on different 
architectures. They must use the available capabilities effectively. 
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Collaboration: Another point is the collaborative efforts to develop models— 
involving PhD students to contribute pieces of science code to a large software 
project—that complicates the maintenance and software engineering of the 
models. 

Data volume: the large amounts of data must be handled efficiently. 


1.1 The AIMES Project 


To address challenges facing earth system modeling, especially for icosahedral 
models, the project Advanced Computation and I/O Methods for Earth-System 
Simulations (AIMES) investigated approaches to mitigate the aforementioned 
programming and storage challenges. 

The AIMES project is part of the SPPEXA program and consisted of the 
consortium: 


Universität Hamburg, Germany 

Institut Pierre Simon Laplace (IPSL), Université Versailles Saint-Quentin-en- 
Yvelines, France 

RIKEN Advanced Institute for Computational Science, Japan 

Global Scientific Information and Computing Center, Tokyo Institute of Technol- 
ogy, Japan 


The project started in March 2016 with plans for 3 years. 

The main objectives of the project were: (1) enhance programmability; (2) 
increase storage efficiency; (3) provide a common benchmark for icosahedral 
models. The project was organized in three work packages covering these aspects 
and a supplementary project management work package to achieve the three project 
objectives. The strategy of the work packages is layed out in the following. 

Under the first work package, higher-level scientific concepts are used to develop 
a dialect for each of three icosahedral models: DYNAMICO [17], ICON [54], 
and NICAM [47]. A domain-specific language (DSL) is the result of finding 
commonalities in the three dialects. Also, a light weight source-to-source translation 
tool is developed. Targeting different architectures to run the high-level code is 
an important aspect of the translation process. Optimizations include applying 
parallelization to the different architectures. Also, providing a memory layout that 
fits the different architectures is considered. 

Under the second work package, data formats for icosahedral models are 
investigated to deal with the I/O limitations. Lossy compression methods are 
developed to increase storage efficiency for icosahedral models. The compression 
is guided with user-provided configuration to allow to use suitable compression 
according to the required data properties. 

Under the third work package, relevant kernels are selected from the three 
icosahedral models. A mini-IGCM is developed based on each of the three models 
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to offer a benchmark for icosahedral models. Developed code is used to evaluate the 
DSL and the compression of icosahedral global modeling. 

The key outcomes of the project described in this article are: (1) the design 
of an effective light-weight DSL that is able to abstract the scientific formulation 
from architecture-specific capabilities; (2) the development of a compression library 
that separates the specification of various qualities that define the tolerable error 
of data from the implementation. We provide some further large-scale results of 
our compression library for large scale runs extending our papers about the library 
[36, 37]; (3) the development of benchmarks for the icosahedral models that are 
mini-applications of the models. 

Various additional contributions were made that are summarized briefly with 
citations to the respective papers: 


e We researched the impact of lossless data compression on energy consumption in 
[2]. In general, the energy consumption increases with the computational intensity 
of the compression algorithm. However, there are algorithms that are efficient and 
less computational intense that can improve the energy-efficiency. 

e We researched compilation time of code that is generated from the DSL using 
alternative optimization options on different compilers [20]. Different optimiza- 
tion options for different files allow different levels of performance, however, 
compilation time is also an important point to consider. Results show that 
some files need less optimization focus while others need further care. Small 
performance drops are measured with considerable reduction in compile times 
when the suitable compilation options are chosen. 

e We researched annotating code for instrumentation automatically by our transla- 
tion tool, to identify resource consuming kernels [21]. Instrumentation allows to 
better find where to focus the optimization efforts. We used the DSL translation 
tool to annotate kernels and make generated code ready for instrumentation. As 
a result performance measurements were recorded with reduced effort as manual 
preparations are not needed anymore. 

e We researched applying vector folding to icosahedral grid codes in a bachelor 
thesis [49]. Vector folding allows to improve use of caches by structuring data in 
a way accounting for caches and data dimensionality. Results show that vector 
folding was difficult to apply manually to icosahedral grids. Performance was 
raised but not significantly as a result of the needed effort that should be invested 
to rewrite kernels with this kind of optimization. 

e We involved ASUCA and the use of Hybrid Fortran [43] to port original CPU 
code to GPUs, to look at a different model with different requirements. 


This article is structured as follows: first, the scope of the state-of-the-art and 
related work is sketched in Sect. 2. In Sect. 3, an alternative development approach 
for code-modernization is introduced. Various experiments to evaluate the benefit of 
the approach are shown in Sect. 4. The compression strategy is described in Sect. 5 
and evaluated in Sect. 6. The benchmarks for the icosahedral models are discussed 
in Sect. 7. Finally, the article is concluded in Sect. 8. 
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2 Related Work 


The related work covers research closely related to domain-specific languages in 
climate and weather and the scientific data compression. 


2.1 Domain-Specific Languages 


DSLs represent an important approach to provide performance portability and 
support model development. A DSL is always developed having a particular domain 
in mind. Some approaches support multiple layers of abstraction. A high-level 
abstraction for the finite element method is provided with Firedrake [44]. The 
ExaStencils pipeline generally addresses stencil codes and their operations [18, 33] 
and many research works introduce sophisticated schemes for the optimization of 
stencils [7, 9]. 

One of the first DSLs which were developed to support atmospheric modeling 
is Atmol [52]. Atmol provided a DSL to allow scientists to describe their models 
using partial differential equations operators. Later, Liszt [14] provided a DSL for 
constructing mesh-based PDE solvers targeting heterogeneous systems. 

Multi-target support was also provided by Physis [42]. Physis is a C-based DSL 
which allows developing models using structured grids. 

Another form of DSL is Icon DSL [50]. Icon DSL was developed to apply index 
interchanges based on described swapping on Fortran-based models. 

Further work based on C++ constructs and generic programming to improve 
performance portability is Stella [23] and later GridTools [13]. Computations are 
specified with a C++-based DSL and the tools generate code for CPUs or GPUs. 
GridTools are used to port some kernels from the NICAM model in our project 
AIMES to evaluate existing DSLs. 

Although C++ provides strong features through generic programming allowing 
to avoid performance portability issues, scientists are reluctant to utilize alternative 
programming languages as the existing codes are huge. Normally scientists prefer 
to keep using preferred languages, e.g. Fortran, rather than moving to learning C++ 
features. 

Other forms of DSLs used directives to drive code porting or optimization. 
Hybrid Fortran [43], HMPP [16], Mint [51], CLAW [12] are examples of directive- 
based approach. Such solutions allow adding directives to code to guide some 
optimization. Scientists write code in some form and add directives that allow tools 
to provide specific features, e.g., CLAW allows writing code for one column and 
allows using directives to apply simulations over the set of columns in parallel. 

In the MetOffice’s LFRic model [1], a DSL is embedded into the Fortran code 
that provides an abstraction level suitable for the model. The model ships with 
the PSyclone code-generator that is able to transform the code for different target 
platforms. In contrast to our lightweight solution, these DSLs are statically defined 
and require a big translation layer. 
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2.2 Compression 


Data-reduction techniques related to our work can be structured into: (1) algorithms 
for the lossless data compression; (2) lossy algorithms designed for scientific 
(floating-point) data and the HPC environment; (3) methods to identify necessary 
data precision and for large-scale evaluation. 


Lossless Algorithms There exist various lossless algorithms and tools, for exam- 
ple, the LZ77 [55] algorithm which utilizes dictionaries. By using sliding windows, 
it scans uncompressed data for two largest windows containing the same data 
and replaces the second occurrence with a pointer to the compressed data of the 
first. The different lossless algorithms vary in their performance characteristics and 
ability to compress data depending on its characteristics. A key limitation is that 
users have to pick an algorithm depending on the use case. In [25], we presented 
compression results for the analysis of typical climate data. Within that work, the 
lossless compression scheme MAFISC with preconditioners was introduced. With 
MAFISC, we also explored the automatic selection of algorithms by compressing 
each block with two algorithms, the best compression chain so far and one randomly 
chosen. It compresses data 10% more than the second best algorithm (e.g., standard 
compression tools). 


Lossy Algorithms for Floating-Point Data SZ[15] and ZFP [41] are the de- 
facto standard for compressing floating-point data in lossy mode. Both provide a 
way of bounding the error (either bit precision or absolute error quantities) but 
only one quantity can be selected at a time. ZFP [41] can be applied up to three 
dimensions. SZ is a newer and effective HPC data compression method; it uses 
a predictor and the lossless compression algorithm GZIP. Its compression ratio is 
typically significantly better than the second-best solution of ZFP. In [26], two lossy 
compression algorithms (GRIB2, APAX) were evaluated regarding the loss of data 
precision, compression ratio, and processing time on synthetic and climate dataset. 
These two algorithms have equivalent compression ratios and depending on the 
dataset APAX signal quality may exceed GRIB2 and vice versa. 


Methods The application of lossy techniques to scientific (floating-point) datasets 
is discussed in [11, 22, 27, 38-40]. A statistical method to predict characteristics 
(such as proportions of file types and compression ratio) of stored data based on 
representative samples was introduced in [34] and the corresponding tool in [35]. It 
can be used to determine compression ratio by scanning a fraction of the data, thus 
reducing costs. 

Efforts for determination of appropriate levels of precision for lossy compression 
methods for climate and weather data were presented in [3] and in [4]. The basic 
idea is to compare the statistics derived from the data before and after applying lossy 
compression schemes; if the scientific conclusions drawn from the data are similar 
and indistinguishable without the compression, the loss of precision is acceptable. 
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Computations in earth system modeling run hundreds or thousands of stencil 
computations over wide grids with huge numbers of points. Such computations are 
time consuming and are sensitive to optimal use of computer resources. Compilers 
usually apply a set of optimizations while compiling code. Semantical rules of 
the general purpose language (GPL) can be applied to the source code and used 
within compilers to translate the code to semantically equivalent code that runs 
more efficiently. However, often the semantical information extracted from the 
source code are not enough to apply all relevant optimizations as some high- 
level optimizations would alter the semantics—and the rules of the GPL forbid 
such changes. As mostly code from GPLs is at a lower semantical level than the 
abstraction level of the developers, opportunity of optimization is lost. Such lost 
opportunities are a main obstacle to develop software in a performance-portable 
way in earth system modeling. 

To address the lost opportunities of optimization, different techniques are applied 
by the scientists directly to the source code. Thus, it is the responsibility of 
the scientists who develop the models to write the code with those decisions 
and guidelines on optimizations in mind. Drawbacks of this strategy include 
pushing scientists to focus on machine details and optimal code design to use 
hardware resources. Scientists need to learn the relevant optimization techniques 
from computer science such as, e.g., cache blocking. 


3.1 Our Approach 


To solve the issues with the manual code optimization, we suggest using an addi- 
tional set of language constructs which exhibit higher-level semantics. This way, 
tools can be used to apply optimizations based on those semantics. Optimization 
responsibilities are moved again from scientists to tools. 

As usually, the source code is developed by scientists, however, in our approach, 
instead of coding on low-level and caring for optimization strategies, a DSL is used 
as abstraction. Machine-specific or computer scientific concepts are not needed to 
write the source code of a model. This is enabled by increasing the abstraction level 
of a GPL by providing a language extension with semantics based on the scientific 
concepts from the domain science itself. 

The DSL implements a template mechanism to simplify and abstract the code. 
For the purpose of this project, it was designed to abstract climate and weather 
scientific concepts but other domains and models could be supported. Therefore, 
stencils and grids are used as the basis in the DSL. The language extensions hide 
the memory access and array notations. They also hide the details of applying the 
stencils and the traversal of the grids. The grid structure itself is hidden in the source 
code. A model’s code uses the grid without specifying the locations of the grid 
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points in memory or how neighbors are accessed. All such details are specified in 
the configuration files. Lower details are neglected at the DSL level. 

Translation tools handle the processing of the higher-level source code and 
converting it to compatible GPL code. The semantics of the added language exten- 
sions are extracted from the source code and are used to apply further high-level 
transformations to the code. Applied transformations are guided by configuration 
files that allow users to control the optimization process and may convert the 
code to various back-end representations that, in-turn, can be converted to code 
that is executed on a machine. A key benefit of this strategy is that it increases 
the performance-portability: A configuration can apply optimization techniques 
exploiting features of a specific architecture. Therefore, a single scientific code base 
can be used to target multiple different machines with different architectures. 

Model-specific configuration files are provided separately to guide the code 
translation and optimization process. Those files are developed by scientific pro- 
grammers rather than domain scientists. In contrast to domain scientists, the 
scientific programmers must have an intermediate understanding of the scientific 
domain but also understand the hardware architecture of a target machine. They use 
their experience to generate code that uses the machine’s resources, and write the 
configurations that serve the purpose of optimal use of that specific machine to run 
the selected model. 

Our approach offers separation of concerns between the parties. Scientific work 
is done by scientists and optimization is done by scientific programmers. The 
concept of the approach is illustrated in Fig. 2. For the icosahedral models, a single 
intermediate domain language is derived that can be adjusted for the needs of each 
model individually (dialects). From this single source various code representations 
(back-ends) could be generated according to configuration, e.g., MPI+OpenMP or 
GASPI. 


Domain science Computer science 


Meta-DSL 


compilers 


DSL tools & 
~ infrastructure 


| GRIDTools: 
———" 


h, pa 
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back-ends existing tools 


Fig. 2 Separation of concerns 
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3.2 Extending Modeling Language 


An important point we consider in our approach is keeping the general-purpose 
language, e.g. Fortran, that the scientists prefer to use to develop their model. 
We add additional language constructs to the GPL. This simplifies the mission 
to port existing models which could include hundreds of thousands of lines of 
code. The majority of the code is kept, while porting some parts incrementally; 
by providing templates in the configuration files, code can be simplified while it 
still produces equivalent GPL constructs: Changes are replacements of loop details 
and field access into alternative form using the added extensions. A drawback 
of the incremental approach is that the full beauty of optimizations like memory 
adjustments requires to have ported the complete model. 

Our plan to develop such language extensions was to start with the three existing 
modern icosahedral models of the project partners. In the first phase, special dialects 
were proposed to support each model. Then, we identified common concepts and 
defined a set of language extensions that support the domain with domain-specific 
language constructs. We collected requirements, and worked in collaboration with 
scientists from the three models to reach at the language extensions, in detail: 


1. The domain scientists suggested compute-intensive and time-consuming code 
parts. 

2. We analyzed the chosen code parts to find out possibilities to use scientific terms 

instead of existing code. We always kept in mind that finding a common repre- 

sentation across the three models leads to domain-specific language extensions. 

We replaced codes with suggested extensions. 

4. We discussed the suggestions with the scientists. Discussions and improvements 
were done iteratively. The result of those discussions lead to the GGDML 
language extensions. 


Ow 


3.2.1 Extensions and Domain-Specific Concepts 


The General Grid Definition and Manipulation Language (GGDML) language 
extensions provide a set of constructs for: 


e Grid definition 

e Declaration of fields over grids 
e Field access and update 

e Grid traversal to apply stencils 
e Stencil reduction expressions 


GGDML code provides an abstraction including the order of the computation of 
elementary operators. Therefore, optimizations can result in minor changes of the 
computed result of floating-point operations; this is intentional as bit-reproducibility 
constraints the optimization potential. 
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In code that is written with GGDML, scientists can specify the name of the 
grid that should be traversed when applying a stencil. The definitions of the grids 
are provided globally through the configuration files. GGDML allows to specify 
a modified set of grid points rather than the whole set of grid points as provided 
through the grid definition grid, e.g., traversing a specific vertical level. Such 
possibilities are offered through naming a grid and using operators that allow 
adding, dropping, or modifying dimensions of that grid. Operators could change 
the dimensionality of the grid or override the existing dimensions. 

Fields are declared over different grids through declaration specifiers. GGDML 
provides a flexible solution to support application requirements. A basic set of 
declaration specifiers allows to control the dimensionality of the grid, and the 
localization of the field with respect to the grid. Such declaration specifiers allow 
applications to deal with surfaces and spaces, and also supports using staggered as 
well as collocated grids. 

Access specifiers provide tools the necessary information that will be used to 
allocate/deallocate and access the fields. Field access is an important part of stencil 
operations. GGDML provides an iterator statement to apply the stencil operations 
over a set of grid points. The GGDML iterator statement replaces loops and the 
necessary optimization to apply stencils. It provides the user an index that refers to 
the current grid point. Using this index, scientists can write their stencils without 
the need to deal with the actual data structures that hold the field data. The iterator 
applies the body to each grid point that is specified in the grid expression, which is 
one part of the iterator statement. This expression is composed from the name of a 
grid, and possibly a set of modifications using operators as mentioned above. 

The iterator’s index alone is not sufficient to write stencil operations, as stencils 
include access to neighboring points. For this purpose, GGDML uses access 
operators, which represent the spatial relationships between the grid points. This 
allows to access the fields that need to be read or written within a stencil operation 
using spatial terms instead of arrays and memory addresses. To support different 
kinds of grids, GGDML allows users to define those access operators according to 
the application needs. 

Repetitions of the same mathematical expressions over different neighbors is 
common in stencil operations. To simplify writing stencils, GGDML provides a 
reduction expression. Reduction expressions apply a given sub-expression over 
multiple neighbors along with a mathematical operator applied to the set of the 
subexpressions. 


3.3 Code Example 


To demonstrate the code written with extensions, a sample code from the NICAM 
model written with Fortran is given in Listing 1. As we can see from the original 
NICAM code, a pattern is repeated in the code: the same field is accessed multiple 
times over multiple indices. Optimization is limited as firstly, the memory layout 
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is hardcoded in the fields cgrad and scl, secondly the iteration order is fixed. 
Integrating blocking for cache optimization in this schema would increase the 
complexity further. 


Listing 1 NICAM Fortran code 


do d = 1, ADM _nxyz 
do 1 = 1, ADM _lall 
!OCL PARALLEL 
! support indices to address neighbors 
do k = 1, ADM kall 
do n = OPRT_nstart, OPRT_nend 


ae] = jal 
ADL) Som 4p al, 
ijpl = ig + ADM gall_i1d 
ipljpl = n + 1 + ADM gall_i1d 
im1j = fl = al 
ijml1 =n = ADM gall_i1d 
imljml = n - 1 - ADM gall _i1d 
grad(n,k,1,d) = cgrad(n,1,0,d) *« seci pis, ily) E 
a> (elepaeyel (il IL al,tel) 3 ee PRY Se 
+ cgrad(n,1,2,d) * scl(ipljp1,k,1) & 
db Ceai (ll, 3,6) 2 sell (Glsiol pe D & 
+ cgrad(n,1,4,d) * scl(imlj ,k,1) & 
+ cgrad(n,1,5,d) *« scl(imljml1,k,1) & 
+ cgrad(n,1,6,d) * scl(ijml_ ,k,1) 
enddo 
grad ( AL ROWDY saksicehae Il jie, dk, Cl) = ORERE 
grad (OPRT_nend+1:ADM gall psp lne = OO R 
enddo 
enddo 


enddo 


The same semantics rewritten with the DSL is shown in Listing 2. Instead of 
iterating across the grid explicitly, a FOREACH loop specifies to run on each 
element of the grid, the coordinates are encoded in the new cell variable. We reduced 
the repeated occurrences of the fields with the indices with a ‘REDUCE’ expression. 
The Fortran indices are replaced with DSL indices that made it possible to simplify 
the field access expressions. 


Listing 2 NICAM DSL code 


FOREACH cell in grid | g{OPRT_nstart..OPRT_nend} 
do d = 1, ADM _nxyz 


grad(cell,d) = REDUCE(+,N={0..6}, 
ra lole teers hat * scl (cell%neighbor (N) ) 
Ss ) 


enddo 
END FOREACH 
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FOREACH cell in GRID%cells | g{1..OPRT_nstart-1 , OPRT_nend+1 


+ .. gall} 
do d = 1, ADM _nxyz 
grad(cell,d) = 0.0 PRECISION 
enddo 


END FOREACH 


3.4 Workflow and Tool Design 


Model code that is based on the DSL along with code of the model in general- 
purpose language goes through a source-to-source translation process. This step is 
essential to make the higher-level code ready for processing by the compilers. Our 
tool is designed in a modular architecture. By applying a configuration, it translates 
code in a file or a source tree and generates a version of the transformed code. 
The generated code is the optimized version for a specific machine/architecture 
according to the configuration. Inter-file optimizations in code trees can also be 
detected and applied. 

The tool is implemented with Python3. Users call the main module with 
parameters that allow them to control the translation process, e.g. to specify a 
language module. The code tree is provided to the main module also as an argument. 

The main module loads the other necessary modules. The DSL handler module 
constructs the necessary data structures according to the user-provided configuration 
file. The source code tree is then parsed into abstract syntax trees (AST). The 
generated ASTs can be analyzed for optimization among the source files. After all 
the optimizations/transformations are applied, the resulting code tree is serialized to 
the output file. Figure 3 shows the design of the translation process. 


S2S 
translation 


Code DS 


Language 
Handlers 


Translation 
Configuration 


Fig. 3 Translation process. Yellow components are influenced by the user options 
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3.5 Configuration 


Configuration files include multiple sections, among which some are essential and 
others can be added only if needed. Optimization procedures are driven by those 
configuration sections. The translation tool uses defaults in case optional sections 
are missing. 


Blocking Among the important optimizations that help improve the performance 
of stencil computations is the optimal use of the caches and memory access. Reusing 
the data in the caches eliminates the need to read the same data elements repeatedly 
from memory. Often, data locality can be exploited in stencil computations, allowing 
for performance improvements. 


Cache blocking is a technique to improve the data reuse in caches. Our 
translation process can apply cache blocking based on the scientific programmer’s 
recommendations. One configuration section is used to allow to specify cache 
blocking information. The default when this section is missing in a configuration 
file is to not apply cache blocking. 

An example kernel using GGDML in the C programming language is shown in 
Listing 3. 


Listing 3 Example kernel using C with GGDML 


foreach e in grid 


float df=(£_Flc.east_edge()]-f£ Flc.west_edge()])/dx; 
float dg=(f£ G[c.north edge ()]-£ G[c.south_edge()]) /dy; 
£ HT [c] =df+dg; 


} 


Applying cache blocking using a configuration file defining 10,000 elements per 
block generates the loop code shown in Listing 4. The first loop handles completely 
occupied blocks with 10k elements and the second loop the remainder. In both 
cases, the loop body (not shown) contains the generated stencil code. 


Listing 4 Example loop structure for blocking 


for (size_t blk start = (0); blk start < (GRIDX); blk_start 
<> +=10000) { 
size t blk end = GRIDX; 
if ((blk_end - blk_start) > 10000) 
blk_end = blk_start + 10000; 
// Generated loop body 
} 
#pragma omp simd 
for (size_t XD index = blk start; XD_index < blk _end; 
<> XD_index++) { 
// Generated loop body 


} 


76 J. Kunkel et al. 


Memory Layout Another important point to optimize memory bandwidth is to 
optimize the data layout in memory. The temporal and spacial locality of data 
should lead to access of data in complete cache lines such that it can be prefetched 
and cached effectively. Thus, data that is accessed in short time frames should be 
stored closer in memory. To exploit such possibilities, our translation tool provides 
a flexible layout transformation procedure. The DSL itself abstracts from data 
placement, however, the translation process generates the actual data accesses. This 
layout specification is described in configuration files. 


Besides to data layout control, the loop nests that access the field data are also 
subject to user control. The order of the loops that form a nested loop is critical for 
the optimal data access. Loop order of loops that apply the stencils is also controlled 
by configuration files. 

Listing 5 illustrates the resulting code after using a data layout transformation. In 
this case, a 2D grid is stored in a single-dimensional array. 


Listing 5 Example code generated with index transformation 


PE 
#pragma omp for 
for (size t YD index = (0); YD index < (llocal Y Cregion); 
<> YD_index++) { 
#pragma omp simd 
for (size_t XD index = blk start; XD_index < blk _ end; 
<> XD_index++) { 
float d= = (£_F[(YD_index + 1) »* (GRIDX + 3) + (XD index 
œ + 1)+1] 
= £ F[(YD_index + 1) * (GRIDX + 3) + (XD_index) + 1]) 
<> x» invdx; 
float dg = (£_ G[((YD_index + 1)+1) * (GRIDX + 3) + ( 
<> XD_ index) +1] 
- £ G[(YD_index + 1) * (GRIDX + 3) + (XD index) + 1]) 
> 44 alighixohiep 
m HT[(YD_index + 1) * (GRIDX + 3) + (XD_index) + 1] = df 
œ> + dg; 


Inter-Kernel Optimization Cache blocking and memory layout allow improving 
the use of the caches and memory bandwidth at the level of the kernel. However, 
the optimal code at the kernel level does not yet guarantee optimal use of caches 
and memory bandwidth at the application level. Consider the example where two 
kernels share most of their input data but compute different outputs independently 
from each other. These kernels could be fused together and benefit from reusing 
cached data. Note that the benefit is system specific, as the size of the cache and 
application kernel determine the optimal size for blocking and fusion. 
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The inter-kernel optimization allows exploiting such data reuse across kernels. 
To exploit such potential, our translation tool can run an optimizer procedure to 
detect such opportunities and to apply them according to user decision of whether 
to apply any of those optimizations or not. 

Therefore, the tool analyzes the calls among the code files within the code 
tree. This analysis leads to a list of call inlining possibilities. The inlining could 
lead to further optimization through loop fusions. The tool runs automatic analysis 
including data dependency and code consistency. This analysis detects possible 
loop fusions that still keep computations consistent. Such loop fusion may lead 
to optimized use of memory bandwidth and caches. We tested the technique 
experimentally (refer to Sect.4.5) to merge kernels in the application. We could 
improve the use of caches and hence the performance of the application with 30- 
48% on different architectures. 

The tool provides a list of possible inlining and loop fusion cases. Users choose 
from the list which case to apply—we anticipate that scientific programmers will 
make an informed choice for a target platform based on performance analysis tools. 
According to the choice that the user makes, the tool applies the corresponding 
transformations automatically. 

Listing 6 shows two kernels to compute the two components of the flux. 


Listing 6 Example code with two kernels to compute flux components 


eee) 
#pragma omp parallel for 
for(size_ t YD index = (0); YD_index < (local_Y Eregion) ; 
<> YD_index++) { 
#pragma omp simd 
for (size t XD index = blk start; XD_index < blk end; 
XD_index++) { 
f£ F[YD_index] [XD_index] = f U[YD_index] [XD_index] + 
(ERH IYDeandesc| (EXD iandes| i r ERYR mde kE mdex E MD 
<> /2.0; 


} 
[eel 
#pragma omp parallel for 
Hom (sizemt wD hees = (0) wd mees < (oce w Ineeian) s 
<> YD_index++) { 
#pragma omp simd 
for (size t XD index = blk start; XD_index < blk end; 
<> XD_index++) { 
£ G[YD_index] [XD_index] = f£ V[YD_ index] [KD_index] + 
(£ H[YD_index] [(XD_index] + f_H[YD_index -1] [XD_index]) 
<> /2.0; 
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Listing 7 shows the resulting code when the two kernels are merged. 


Listing 7 Merged version of the flux computation kernels 


ee 
#pragma omp parallel for 
for (size t YD index = 0; YD index < (local_Y Eregion) ; 
<> YD_index++) { 
#pragma omp simd 
for (size_t XD index = blk start; XD_index < blk _end; 
<> XD_index++) { 
£ F[YD_index] [XD_index] = f£ _U[YD_index] [XD_index] « 
(£ H[YD_index] [XD_index] + f£ H[YD index] [(XD_index) + 
=> (=i) 203 


G[YD_index] [XD_index] = f£ V[YD_index] [XD index] x 
H[YD_index] [(XD_index] + f_H[(YD_index) + (-1)][ 
<> XD index])/2.0; 


E 
(£ 


Utilizing Distributed Memory Beyond parallelization on the node resources, our 
techniques allow scaling the same source code that uses GGDML over multiple 
nodes utilizing distributed memory. This is essential to run modern models on 
modern supercomputer machines. 


The GGDML code is unaware of underlying hardware, and does not need to 
be modified to run on multiple nodes. Rather, configuration files are prepared to 
translate the GGDML code into code that is ready to be run on multiple nodes. 
Configuration files allow domain decomposition to distribute the data and the 
computation over the nodes. Necessary communication of halo regions is also 
enabled through configuration files. Scientific programmers can generate simple 
parallelization schemes, e.g., MPI using blocking communication or sophisticated 
alternatives like non-blocking communication. When using non-blocking communi- 
cation, a further optimization is to decouple the computation of the inner region that 
can be calculated without needing updated halo data and outer regions that require 
data from another process to be computed. 

The translation tool extracts neighborhood information from the GGDML 
extensions. Such extracted information is analyzed by the tool to decide which 
halo regions should be exchanged between which nodes. Decisions and information 
from configuration files allow to generate the necessary code that handles the 
communication and the synchronization. This guarantees that the necessary data 
are consistent on the node where the computation takes place. 

The parallelization is a flexible technique. No single library is used to handle the 
parallelization, rather, the communication library is provided through configuration 
files. Thus, different libraries or library versions can be used for this purpose. We 
have examined the use of MPI and GASPI as libraries for parallelization. 
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Listing 8 shows the resulting communication code of halo regions between 
multiple processes—in this case without decoupling of inner and outer area, 
code with decoupled areas is longer. In this listing, dirty flags are generated to 
communicate only necessary data. Flags are set global to all processes, and can 
be checked by each process such that processes that need to do communication can 
make use of them. This way we guarantee to handle communication properly. 


Listing 8 Example generated code to handle communication of halo data 


Ml 
//part of the halo exchange code 
ie (E @ chey clec (fala) == 1) y 
if (mpi_world_size = 1) { 
comm _tag++; 


int pp = mpi rank != 0 mpi rank - 1 ; mpi world size -— 
coe ale 

int np = mpi_rank != mpi_world_size - 1 ? mpi_rank + 1 
e> 0; 


MPI_Isend(f_G[0], GRIDX + 1, MPI_FLOAT, pp, 
comm _ tag, MPI_COMM WORLD, &mpi_requests [0]); 
MPI_Irecv(f£ G[local_Y Eregion], GRIDX + 1, MPI_FLOAT, np, 
comm tag, MPI_COMM WORLD, &mpi_requests [1]); 
MPI Waitall(2, mpi_requests, MPI_STATUSES IGNORE) ; 
lee) 


#pragma omp parallel for 
for(size t YD index = (0); YD index < (local Y Cregion) ; 
<> YD_index++) { 
#pragma omp simd 
for (size t XD index = blk start; XD_index < blk _ end; 
<> XD_index++) { 


iBilyoyelic hE = (i EAD aaee] EED inses) ce (al) || 
- £ F[YD_ index] [XD_index]) * invdx; 
float dg = (£_G[(YD_index) + (1)] [XD_index] 


- £ G[YD_index] [XD_index]) * invdy; 
f£ HT[YD_ index] [KD_index] = df + dg; 
} 


} 
evel 


3.6 Estimating DSL Impact on Code Quality and Development 
Costs 


To estimate the impact of using the DSL on the quality of the code and the 
costs of model development, we took two relevant kernels from each of the three 
icosahedral models, and analyzed the achieved code reduction in terms of lines 
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Fig. 4 GGDML impact on 
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of code (LOC) [32]. We rewrote the kernels (originally written in Fortran) using 
GGDML + Fortran. Results are shown in Fig. 4. 

The average reduction in terms of LOC is 70%, i.e. LOC in GGDML-+Fortran 
in comparison to original Fortran code is 30%. More reduction is noticed in some 
stencils (NICAM example No.2, reduced to 12%). 

Influence on readability and maintainability: Using COCOMO [6] as a model 
to estimate complexity of development effort and costs, we estimated in Table 1 
the benefits as a result of the code reductions when applying GGDML to develop a 
model comparable to the ICON model. The table shows the effort in person month, 
development time and average number of people (rounded) for three development 
modes: the embedded model is typically for large project teams a big and complex 
code base, the organic model for small code and the semi-detached mode for in- 
between. We assume the semi-detached model is appropriate but as COCOMO 
was developed for industry projects, we don’t want to restrict the development 
model. The estimations are based on a code with 400KLOC, where 300KLOC of 
the code are the scientific portion that allows for code reduction while 100KLOC 
are infrastructure. 

From the predicted developed effort, it is apparent that the code reductions 
would be leading to a significant effort and cost reduction that would justify the 
development and investment in DSL concepts and tools. 


4 Evaluating Performance of our DSL 


To illustrate the performance benefits of using the DSL for modeling, we present 
some performance measurements that were measured for example codes written 
with the DSL and translated considering different optimization aspects (configura- 
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tions). Two different testcodes were used to evaluate the DSL’s support for different 
types of grids: One application uses an unstructured triangular grid, and the other 
uses a structured rectangular grid that could be applied for code in cubic sphere. 
Both were written using GGDML (our DSL) in addition to C as the host modeling 
language. 


4.1 Test Applications 


Laplacian Solver The first application code uses an unstructured triangular grid 
covering the surface of the globe. The application was used in the experiments 
to apply the Laplacian operator of a field at the cell centers based on field values 
at neighboring cells. Generally, this code includes fields that are localized at the 
cell centers, and on the edges of the cells. The horizontal grid of the globe surface 
is mapped to a one dimensional array using Hilbert space-filling-curve. We used 
1,048,576 grid points (and more points over multiple-node runs) to discretize the 
surface of the globe. The code is written with 64 vertical levels. The surface is 
divided into blocks. The kernels are organized into components, each of which 
resembles a scientific process. 


Shallow Water Model The other code is a shallow water equation solver. It is 
developed with a structured grid. Structured grids are also important to study 
for icosahedral modeling, as some icosahedral grids can be structured. Fields are 
located at centers of cells and on edges between cells. This solver uses the finite 
difference method. The test code is available online.! As part of the testing, we 
investigate performance portability of code developed using the DSL. 


4.2 Test Systems 


The experiments were executed in different times during the course of the project 
and used different machines based on availability and architectural features. 


e Mistral 
The German Climate Computing Center provides nodes with Intel(R) Xeon(R) 
E5-2695 v4 (Broadwell) @ 2.1 GHz processors. 

e Piz Daint 
The Swiss supercomputer provides nodes equipped with two Intel(R) Xeon(R) 
CPU E5-2690 v3 @ 2.60 GHz processors and NVIDIA(R) Tesla(R) P100 GPUs. 


'https://github.com/aimes-project/Shallow WaterEquations. 
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Fig. 5 Impact of blocking: Performance measurements with variable grid width. (a) Broadwell 
CPU. (b) P100 GPU 


e NEC test system 


Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30 GHz Broadwell processors with 
Aurora vector engines. 


4.3 Evaluating Blocking 


To explore the benefit of organizing memory accesses efficiently across architec- 
tures, experiments using the shallow water equation solver code were conducted 
on Piz Daint. First, we generated code versions with and without blocking for the 
Broadwell processor and the P100 GPU. An excerpt of results presented for different 
size of grids is shown in [29] and in Fig. 5a. The experiment was done with grid 
width of 200K. In the paper, we investigated the influence of the blocking factor on 
the application kernel further revealing that modest block sizes are leading to best 
performance (256 to 10k for CPU and 2—10k for GPU). On both architectures, 
wider grids run less efficiently as a result of an inefficient use of caches. The 
GPU implies additional overhead for the blocked version, requiring to run with a 
sufficiently large grid to benefit from it. This also shows the need to dynamically 
turn on/off blocking factors depending on the system capabilities. 


4.4 Evaluating Vectorization and Memory Layout Optimization 


As part of [28], we evaluated techniques to apply memory layout transformations. 
The experiments were done on two architectures, the Broadwell multi-core proces- 
sors and the Aurora vector engine on the NEC test platform using the shallow water 
equation solver code. 
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Table 2 Performance of memory layout variants on CPU and the NEC vector architec- 


ture 
. Short f 
Architecture Scattered j Contiguous 
distance 
Broadwell 3 GFlops 13 GFlops 25 GFlops 
NEC Aurora 80 GFlops 161 GFlops 322 GFlops 
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We used alternative layouts with different distances between data elements to 
investigate the impact of the data layout on the performance. The explored data 
alternatives were data accesses to 


e contiguous unit stride arrays, 

e interleaved data with constant short distance separating data elements, 4 bytes 
separating consecutive elements. This allowed to emulate array of structures 
(AoS) layouts, 

e scattered data elements separated with long distances. 


The results are listed in brief in Table 2. Using memory profilers, we found that 
the contiguous unit-stride code allowed to use the memory throughput efficiently 
on both architectures. In the emulated AoS codes, the efficiency dropped on both 
architectures. The worst measurements were taken for the scattered data elements. 

Besides the impact of the optimization on the use of the memory bandwidth, 
vectorization is also affected by those alternatives. AVX2 was used for all kernels 
on Broadwell for the unit-stride code. Similarly, the vector units of the vector engine 
showed best results with this layout. Again the use of the vectorization was degraded 
with the emulated AoS, and was even worse with the scattered data elements. 


4.5 Evaluating Inter-Kernel Optimization 


To explore the benefit of kernel merging, experiments were done using the shallow 
water equation solver code on the NEC test system (for vector engines) and Piz 
Daint (for GPUs and CPUs) [29]. The performance of regular and merged kernels 
are shown in Table 3. The code achieves near optimal use of the memory bandwidth 
already before the merge and actually decreases slightly after the merge. Exploiting 
the inter-kernel cache reuse allowed to reduce the data access in memory and 
increased the total performance of the application by 30-50%. 


4.6 Scaling with Multiple-Node Runs 


To demonstrate the ability of the DSL to support global icosahedral models, we 
carried out experiments using the two applications. Scalability experiments of 
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Table 3 Performance and efficiency of the kernel fusioning on all three architectures 


Before merge With Inter-Kernel Merging 
Theoretical Measured Measured 
f . Memory memory p memory 
Sacer bandwidth throughput GElops throughput GEiope 
(GB/s) (GB/s) and peak (GB/s) and peak 
Broadwell 77 62 (80%) | 24 | ~~ 60 (78%) 
P100 GPU 500 380 (76%) 389 (78%) 
NEC Aurora 1,200 961 (80%) | 322 {911 (76%) | 
scaling Architecture 
on ~® strong 90004 ~*~ Broadwell 
ao oe ~~ 
g 100 weak a P100 
- o 
F & 6000 
S i 
£ o] 
= 504 
2 30004 R 
5 z 
à 
oL : ; t opre iz i l 
0 10 20 30 40 50 0 25 50 75 100 
MPI Processes Nodes 
(a) (b) 


Fig. 6 Scalability of the two different models (measured on Mistral; P100 on Piz Daint). (a) 
Icosahedral code [30]. (b) Shallow water solver [31] 


unstructured grid code were run on Mistral. Shallow water equation solver code 
experiments were run on Mistral for CPU tests, and on Piz Daint for GPU tests. 

In the experiment using the unstructured grid, we use the global grid of the 
application and apply a three-dimensional Laplacian stencil. We varied the number 
of nodes that we use to run the code up to 48 nodes. The minimum number of the 
grid points we used is 1,048,576. We used this number of points for the strong-scale 
analysis. Weak scalability experiments were based on this number of points for each 
node. Figure 6a shows the results. 

We could do further numbers of nodes, however, we found that the code was 
scaling with the tested cases and further experiments needed resources and time to 
get jobs to run on the test machine. For the measured cases, the weak scalability of 
the code is close to optimal. Thus, increasing the resolution of the grids and running 
the code on more nodes is achieved efficiently. This is an important point as higher 
resolution grids are essential for recent and future global simulations. 

We also carried out experiments to scale the shallow water equation solver on 
both Broadwell multi-core processors and on the P100 GPUs at Piz Daint. We 
generated code for Broadwell experiments with OpenMP as well as MPI, and for 
the GPUs with OpenACC as well as MPI. Figure 6b shows the measured results of 
scaling the code. On the Broadwell processor, we used 36 OpenMP threads on each 
node. 


86 J. Kunkel et al. 


While the performance of the GPU code scaled well, it loses quite some 
efficiency when running on two processes as the halo communication involves the 
host code. In general, the code that the tools generate for multiple nodes shows to 
be scalable, both on CPUs and GPUs. The DSL code does not include any details 
regarding single or multiple node configuration, so users do not need to care about 
multiple node parallelization. However, the parallelization can still be applied by 
the tool and the users can still control the parallelization process. 


4.7 Exploring the Use of Alternative DSLs: GridTools 


The GridTools framework is a set of libraries and utilities to develop performance- 
portable weather and climate applications. It is developed at The Swiss National 
Supercomputing Center [13]. To achieve the goal of performance portability, the 
user code is written in a generic form which is then optimized for a given archi- 
tecture at compile time. The core of GridTools is the stencil composition module 
which implements a DSL embedded in C++ for stencils and stencil-like patterns. 
Further, GridTools provides modules for halo exchanges, boundary conditions, data 
management and bindings to C and Fortran. GridTools is successfully used to 
accelerate the dynamical core of the COSMO model with improved performance on 
CUDA-GPUs compared to the current official version, demonstrating production 
quality and feature-completeness of the library for models on lat-lon grids [48]. 
Although GridTools was developed for weather and climate applications it might be 
applicable for other domains with a focus on stencil-like computations. 

In the context of the AIMES project, we evaluated the viability of using 
GridTools for the dynamical core of NICAM: namely NICAM-DC. Since NICAM- 
DC is written in Fortan, we first had to port the code to C++, which includes 
also changing the build systems. Figures 9 and 10 show simple example codes 
extracted from NICAM-DC and ported to GridTools, respectively. We ported the 
dynamical core using the following incremental approach. First, each operator was 
ported individually to GridTools, i.e. re-written from Fortran to C++. Next, we used 
a verification tool to assure that the same input to the C++ and Fortran version gives 
the same output. Next we move on to the following operator. Table 4 shows results 
from benchmarks extracted from NICAM-DC. It provides good speedup on GPU, 
and speed on CPU (in OpenMP) comparable to the hand-written version. Figure 7 
shows results for running all operators of NICAM-DC on 10 nodes. It is worth 
mentioning that the most time consuming operator is more than 7x faster on GPU 
versus CPU. 

GridTools demonstrates good GPU performance and acceptable CPU perfor- 
mance. The functionalities and features included in GridTools were enough to 
support the regular mesh code of NICAM-DC without friction (i.e. no custom 
features were required in GridTools to support the requirements of NICAM-DC). 
In addition, GridTools are transparent in the sense that no information about the 
platform is exposed to the end-user. On the other hand, following is a list of issues 
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Table 4 Execution time (seconds) of different benchmarks extracted from NICAM-DC. This 
includes the regular regions only, using 1 region, a single MPI rank for a 130 x 130 x 42 grid 


Benchmark CUDA (Nvidia K40) OpenMP (Broadwell-EP CPU E5-2630 v4 @2.20GHz) 


Manual_opt OMP_THREADS=1 | OMP_THREADS=5 | OMP_THREADS=10 
(coal, sh_mem, 
occu, reg_pres) Manual | GridTools ian GridTools D e 


Diffusion 0.575 0.61 19.2 19.4 4.3 2.2 3.01 
(5.23x (4.93) | (1.0x) | (1.0x) | (4.46x) (8.72x) | (3.22x) 
OMP=10) 


Divdamp 35.23 3.15 3.17 74.3 743 15.4 30.08 77 15.22 

(erid40_so0m_2akm) (4.83x (4.80x) | (1.0x) (1.0x) | (4.84x) | (2.47x) | (4.78x) | (2.44x) 
OMP=10) 

Vi_rhow_Solver | 0.91 .288 0.311 12.0 12.1 3.18 1.23 1.77 

(varid40_600m_24ian) (6.14x (5.69x) | (1.0x) | (1.0x) (3.8x) (4.87) | (3.4x) 

OMP=10) 


Runtime for an entire run (Seconds) 


OPRT_diffusion P'S), 


OPRT_divdamp  fume”o28368 


OPRT_laplacian kbang e 


i 1.929230769 

OPRT_gradient 1254 

$ 2.246588236 
OPRT3D_divdamp Diiis 3036 


OPRT_horizontalize_vec Jie... 
OPRT divergence e sss 389 


0 20 40 60 80 100 120 140 160 


m Tesla® P100 m Xeon® E5-2695 v4 @ 2.10GHz (Broadwell 18 cores) 


Fig. 7 NICAM-DC operators. Running on 10 nodes with one MPI rank per node. P100 is running 
GridTools generated kernels and Xeon uses the original Fortran code. Total grid is 32 x 32 x 10 
using 40 vertical layers 


that requires one’s consideration when using GridTools: first, the requirement of 
rewriting the entire dynamical core in C++ is not a trivial task, especially since C++ 
templates make the code more convoluted, in comparison to Fortran. Second, while 
GridTools as a stencil framework does a good job for the dynamical core, separate 
solutions are required for the physics modules, the communicator module, and the 
non-regular compute modules (e.g. polar regions). Using different solutions inside 
the same code base typically increases the friction between code modules. Third, 
the interfacing between Fortran and C++ is non-trivial and can be troublesome 
considering that not all end-users are willing to change their build process. 
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Listing 9 Example of the diffusion operator extracted from NICAM-DC 


XDIR, ZDIR 
jmin-1, jmax 
= imin-1, imax 
veti JE = 


& 
& 
) 
& 
& 
) 


& 
& 


) 


Listing 10 Example of the diffusion operator ported to GridTools 


template <typename evaluation> 
GT_FUNCTION 


static void Do(evaluation const & eval, x_ 


- 1.0 * eval (coef {a+1} 

- 1.0 * eval (coef {a+2} 
Soa 
1.0 * eval (coef {}) 
2.0 * eval (coef{a+1} 
1.0 * eval (coef {a+2} 
<> i+1}) 
1.0 * eval (coef {}) 
1.0 * eval (coef {a+1} 
2.0 * eval (coef {a+2} 
<> j+1}) 

l Ff BaOp 


1 
2 
3 
4 eval(vt{}) = (( + 2.0 * eval(coef{}) 
5 
6 


5 Massive I/O and Compression 


5.1 The Scientific Compression Library (SCIL) 


ce SE (al, 3) 56) & 


* scl (i+1,j,d) 


* scl (i,j+1,d) 


interval) { 


) x eval(scl 


) * eval (scl { 


) * eval (scl{ 


The developed compression library SCIL [36] provides a framework to compress 
structured and unstructured data using the best available (lossless or lossy) compres- 
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SCIL FRAMEWORK 


ALGORITHMS 


Application SCIL Tools ZFP 


quantities 
and data 


NetCDF4 abstol 


+ Quantities support 


O direct use 


reltol 
quantities SSS 


quantities, 
and data 


and data 


allquant 
HDF5 : 

a SCIL [ wm | 

+ Quantities support | 
+ SCIL Filter a a 

' delta 
compressed, compressed ————————— 
data T data 
m 


HDFS5-File 


Fig. 8 SCIL compression path and components (extended)[36] 


sion algorithms according to the definition of tolerable loss of accuracy and required 
performance. SCIL acts as a meta-compressor providing various backends such as 
algorithms like LZ4, ZFP, SZ but also integrates some alternative algorithms. 

The data path of SCIL is illustrated in Fig. 8. An application can either use 
NetCDF4 [45], HDF5 [19] or directly the C-interface of SCIL. Based on the 
defined quantities, the values and the characteristics of the data to compress, the 
appropriate compression algorithm is chosen. SCIL also comes with a library to 
generate various synthetic test patterns for compression studies, i.e., well-defined 
multi-dimensional data patterns of any size. Further tools are provided to plot, to 
add noise or to compress CSV and NetCDF3 files. 


5.2 Supported Quantities 


There are three types of quantities supported: 


Accuracy Quantities define the tolerable error on lossy compression. When 
compressing the value v to d it bounds the residual error (r = v — 0): 


e absolute tolerance: v — abstol < Ô < v + abstol 
e relative tolerance: v/(1 + reltol) < Ò < v- (1 + reltol) 


2HDF5 and NetCDF4 are APIs and self-describing data formats for storing multi-dimensional data 
with user-relevant metadata. 
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e relative error finest tolerance: used together with relative tolerance; absolute 
tolerable error for small v’s. If relfinest > |v - (1 + reltol)|, then v — relfinest < 
Ù < v + relfinest 

¢ significant digits: number of significant decimal digits 

e significant bits: number of significant digits in bits 


SCIL must ensure that all the set accuracy quantities are honored regardless of 
the algorithm chosen, meaning that one can set, e.g., absolute and relative tolerance 
and the strictest of the criteria is satisfied. 


Performance Quantities These quantities define the expected performance behav- 
ior for both compression and decompression (on the same system). The value can 
be defined according to: (1) absolute throughput in MiB or GiB; or (2) relative to 
network or storage speed. It is considered to be the expected performance for SCIL 
but it may not be as strictly handled as the qualities—there may be some cases in 
which performance is lower. Thus, SCIL must estimate the compression rates for 
the data. 


Supplementary Quantities An orthogonal quantity that can be set is the so called 
fill value, a value that scientists use to mark special data points. This value must be 
preserved accurately and usually is a specific high or low value that may disturb a 
smooth compression algorithm. 


5.3 Compression Chain 


Internally, SCIL creates a compression chain which can involve several compression 
algorithms as illustrated in Fig. 9. Based on the basic datatype that is supplied, the 
initial stage of the chain is entered. Algorithms may be preconditioners to optimize 
data layout for subsequent compression algorithms, converters from one data format 
to another, or, on the final stage, a lossless compressor. Floating-point data can be 
first mapped to integer data and then to a byte stream. Intermediate steps can be 
skipped. 


process data process data 


Array of = Array of Pune 
i Type-To-Integer Type-To-Byte 
Type Teslype Converter integer To Intese Compressor 
Preconditioners Preconditioners 


data float float int any any 


Byte-To-Byte 
Compressor 
compr. 


data 


Fig. 9 SCIL compression chain. The data path depends on the input data type [36] 
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5.4 Algorithms 


SCIL comes with additional algorithms that are derived to support one or multiple 
accuracy quantities set by the user. For example, the algorithms Abstol (for absolute 
tolerance), Sigbits (for significant bits/digits), and Allquant. These algorithms aim to 
pack the number of required bits as tightly as possible into the data buffer but operate 
on each value independently. While Abstol and Sigbits just consider one quantity, 
Allquant considers all quantities together and chooses the required operation for 
a data point depending on the highest precision needed. We also consider these 
algorithms to be useful baselines when comparing any other algorithm. ZFP and 
SZ, for example, work on one quantity, too. 

During the project, we explored the implementation for the automatic algorithm 
selection but only integrated a trivial scheme for the following reasons: if only a 
single quantity is set, we found out that the optimal parameter depends on many 
factors (features); the resulting optimal choice is embedded in a multi-dimensional 
space—this made it infeasible to identify the optimal algorithm. Once more than a 
single quantity is set, only one of the newly integrated algorithms can perform the 
compression, which eliminates any choice. As the decoupling of SCIL enables to 
integrate algorithms in the future, we hope that more algorithms will be developed 
that can then benefit from implementing the automatic selection. 


6 Evaluation of the Compression Library SCIL 


We evaluated the performance and compression ratio of SCIL against several 
scientific data sets and the synthetic test patterns generated by SCIL itself [36]. 


6.1 Single Core Performance 


In the following, an excerpt of the experiments conducted with SCIL on a single core 
is shown. These results help to understand the performance behavior of compression 
algorithms and their general characteristics. 

Four data sets were used each with precision floating-point data (32 bit): (1) the 
data created with the SCIL pattern library (10 data sets each with different random 
seed numbers). The synthetic data has the dimensionality of (300 x 300 x 100 = 
36 MB); (2) the output of the ECHAM atmospheric model [46] which stored 123 
different scientific variables for a single timestep as NetCDF; (3) the output of the 
hurricane Isabel model which stored 633 variables for a single timestep as binary;* 
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Table 5 Harmonic mean 
compression performance for 
different scientific data sets 


J. Kunkel et al. 


Compr. | Decomp. 
Algorithm Ratio | MiB/s | MiB/s 


(a) 1% absolute tolerance 


NICAM 59 
458 
m 
302 
ECHAM 260 
196 
J 
iss 
Isabel 352 
79 
70 
739 
Random 365 
356 
s; 
145 


(b) 9 bits precision 


503 
456 
400 
169 
301 
403 
356 
187 
428 
382 
382 
125 
241 


NICAM [sights [0439 [257 [an 


182 
126 
ECHAM 462 
221 
155 
Isabel 301 
i97 
133 


341 
182 
615 


Random [sigits [0346 |358 [511 


sigbits,1z4 0.348 | 346 
0.252 |151 


(4) the output of the NICAM Icosahedral Global Atmospheric model which stored 


83 variables as NetCDF. 


The characteristics of the scientific data varies and so does data locality within 
the data sets. For example, in the Isabel data many variables are between 0 and 0.02, 
many between —80 and +80 and some are between —5000 and 3000. 

We set only one quantity to allow using ZFP and SZ for comparison. Table 5 
shows the harmonic mean compression ratio* for setting an absolute error of 1% of 


3http://vis.computer.org/vis2004contest/data.html. 


‘The ratio is the resulting file size divided by the original file size. 
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Fig. 10 Compressing various climate variables with absolute tolerance 1% 


the maximum value or setting precision to 9 bit accuracy. The harmonic mean cor- 
responds to the total reduction and performance when compressing/decompressing 
all the data. 

The results for compressing 11 variables of the whole NICAM model via the 
NetCDF API are shown in Fig. 10. The x-axis represents the different data files, as 
each file consists of several chunks, a point in the y-axis represents one chunk. It 
can be observed that generally the SZ algorithm yields the best compression ratio 
but Abstol+LZ4 yields the second best ratio providing much better and predictable 
compression and decompression speeds. 

Note that for some individual variables, one algorithm may supersede another in 
terms of ratio. As expected there are cases in which one algorithm is outperforming 
the other algorithms in terms of compression ratio which justifies the need for a 
metacompressor like SCIL that can make smart choices on behalf of the users. Some 
algorithms perform generally better than others in terms of performance. Since in 
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our use cases, users define the tolerable error, we did not investigate metrics that 
compare the precision for a fixed compression ratio (e.g., the signal to noise ratio). 

While a performance of 200 MB/s may look insufficient for a single core, with 
24 cores per node a good speed per node can be achieved that is still beneficial 
for medium large runs on shared storage. For instance, consider a storage system 
that can achieve 500GB/s. Considering that one node with typical Infiniband 
configuration can transfer at least 5GB/s, 100 client nodes saturate the storage. 
By compressing 5:1 (or ratio of 0.2), virtually, the storage could achieve a peak 
performance of 2500GB/s, and, thus, can serve up to 500 client nodes with 
(theoretical) maximum performance. 

Trading of storage capacity vs. space is an elementary issue to optimize bigger 
workflows. By separating the concerns between the necessary data quality as defined 
by scientists and compression library, site-specific policies could be employed that 
depend also on the available hardware. 


6.2 Compression in HDFS5 and NetCDF 


We tested the compression library SCIL using the icosahedral grid code. The code 
can use NetCDF to output the generated data periodically. In this experiment, a 
high-resolution grid with 268 million grid cells (single precision floating point) in 
the horizontal times 64 vertical levels was used and executed on the supercomputer 
Mistral. The code was run on 128 nodes, with one MPI process per node. It wrote 
one field to the output file in one timestep. The field values range between —10 to 
+55 which is important for understanding the impact of the compression. 

The experiments varied the basic compression algorithms and parameters pro- 
vided by SCIL. Compression is done with the algorithms 


memcopy: does not do any real compression, but allows to measure the overhead 
of the usage of enabling HDF5 compression and SCIL. 

1z4: the well-known compression algorithm. It is unable to compress floating- 
point data but slows down the execution. 

abstol,lz4: processes data elements based on the absolute value of each point, we 
control the tolerance by the parameter absolute_tolerance, after quantization an 
LZ4 compression step is added. 

sigbits,lz4: processes data elements based on a percentage of the value being pro- 
cessed, we control the tolerance by the parameter relative_tolerance_percent, 
after quantization an LZ4 compression step is added. 


The results of this selection of compression methods is shown in Table 6, it 
shows the write time in seconds and resulting data size in GB, a virtual throughput 
relative to the uncompressed file size, and the speedup. Without compression the 
performance is quite poor: achieving only 432 MB/s on Mistral on 128 nodes, 
while an optimal benchmark can achieve 100GB/s. The HDF5 compression is 
not yet optimally parallelized and requires certain collective operations to update 
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Table 6 Compression results of 128 processes on Mistral 


Write time | Data size | Throughput* 


Compression method | Parameter ins in GB in MB/s Speedup 

No-compression 165.3 71.4 432 1.0 

memcopy 570.1 71.4 125 0.3 

124 795.3 71.9 90 0.2 

abstol,1z4 absolute_tolerance=1 12.8 DT: 5578 12.9 
absolute_tolerance=.1 72.6 2.8 983 23 

sigbits,1z4 relative_tolerance_percent=1 12.9 2.9 5535 12.8 
relative_tolerance_percent=.1| 18.3 3.2 3902 9.0 


the metadata. Internally, HDF5 requires additional data buffers. This leads to extra 
overhead in the compression slowing down the I/O (see the memcopy and and LZ4 
results which serve as baselines). By activating lossy compression and accepting an 
accuracy of 1% or 0.1%, the performance can be improved in this example up to 
13x. 

Remember that these results serve as feasibility study. One of our goals was 
to provide a NetCDF backwards compatible compression method not to optimize 
the compression data path inside HDF5. The SCIL library can be used directly by 
existing models avoiding the overhead and leading to the results as shown above. 


7 Standardized Icosahedral Benchmarks 


Our research on DSL and I/O is more practical. We started with real-world appli- 
cations, namely three global atmospheric models developed by the participating 
countries. The global atmospheric model with the icosahedral grid system is one 
of the new generation global climate/weather models. The grid-point calculations, 
which are less computational intense than the spherical harmonics transformation, 
are used in the model. On the other hand, patterns of the data access in differential 
operators are more complicated than the traditional limited-area atmospheric model 
with the Cartesian grid system. 

There are different implementations of the dynamical core on the icosahedral 
grid: direct vs. indirect memory access, staggered vs. co-located data distribution, 
and so on. The objective of standardization of benchmarks is to provide a variety 
of computational patterns of the icosahedral atmospheric models. The kernels are 
useful for evaluating the performance of new machines and new programming 
models. Not only for our studies but also for the existing/future DSL studies, this 
benchmark set provides various samples of the source code. The icosahedral grid 
system is an unstructured grid coordinate, and there are a lot of challenging issues 
about data decomposition, data layout, loop structures, cache blocking, threading, 
offloading the accelerators, and so on. By applying DSLs or frameworks to the 
kernels, the developers can try the detailed, practical evaluation of their software. 


96 J. Kunkel et al. 


Kernel suite Variants 


OpenMP ver. 
Input J Stencil kernels 
| input | Ave 
| input | Stencil kernels 
Stencil kernels 
Reference output ; Stencil kernels 


[Documentation | 


Fig. 11 Overview of IcoAtmosBenchmark v1 


The benchmarks were used in the final evaluation of SCIL and they steered the DSL 
development by providing the relevant patterns. 


7.1 IcoAtmosBenchmark v1: Kernels from Icosahedral 
Atmospheric Models 


IcoAtmosBenchmark v1 is the package of kernels extracted from three Icosahedral 
Global Atmospheric models, NICAM, ICON, DYNAMICO. As shown in Fig. 11, 
we prepared input data and reference output data for easy validation of results. We 
also arranged documentation about the kernels. The package is available online." 


Shttps://aimes-project.github.io/IcoAtmosBenchmark_v1/. 
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7.1.1 Documentation 


An excerpt to the NICAM kernel serves as an example. The icosahedral grid on the 
sphere of NICAM is the unstructured grid system. In NICAM code, the complex 
topology of the grid system is separated into the structured and unstructured part. 
The grids are decomposed into tiles, and one or more tiles are allocated to each 
MPI process. The horizontal grids in the tile are kept in a 2-dimensional structure. 
On the other hand, a topology of the network of the tiles is complex. We selected 
and extracted 6+1 computational kernels from the dynamical core of NICAM, as 
samples of the stencil calculation on the structured grid system. We also extracted a 
communication kernel, as a sample of halo exchange in the complex node topology. 
The features of each kernel are documented on the GitHub page. 

All kernels are single subroutines and almost the same as the source codes in 
the original model, except for the ICON kernels. They are put into the wrapper 
for the kernel program. Values of input variables in the argument list of the kernel 
subroutine are stored as a data file, just before the call in the execution of the 
original model. They are read and given to the subroutine in the kernel program. 
Similarly, the values of output variables in the argument list are stored, just after 
the call in execution. They are read and compared to the actual output values of 
kernel subroutine. The differences are written to the standard output for validation. 
For easy handling of the input/reference data by both the Fortran program and C 
program, we prepared an I/O routine written in C. 

We provided a user manual, which contains the brief introduction of each model, 
the description of each kernel, usage of kernel programs, and sample results. This 
information is helpful for users of this kernel suite in the future. 


8 Summary and Conclusion 


The numerical simulation of climate and weather is demanding for computational 
and I/O resources. Within the AIMES project, we addressed those challenges and 
researched approaches that foster the separation of concerns. This idea unites 
our approaches for the DSL and the compression library. While a higher level 
of abstraction can improve the productivity for scientists, most importantly the 
decoupling of requirements from the implementation allows scientific programmers 
to develop and improve architecture-specific optimizations. 

Promising candidates for DSLs have been explored and with GGDML an alter- 
native has been developed that covers the most relevant formulations of the three 
icosahedral models: DYNAMICO, ICON, and NICAM. The DSL and toolchain we 
developed integrates into existing code bases and suits for incremental reformulation 
of the code. We estimated the benefit for code reduction and demonstrated several 
optimizations for CPU, GPU, and vector architectures. Our DSL allows to reduce 
code to around 30% of the LOC in comparison to code written with GPL code. 
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With the semantics of GGDML, we could achieve near optimal use of memory 
hierarchies and memory throughput which is critical for the family of computations 
in hand. Our experiments show running codes with around 80% of achievable 
memory throughput on different archiectures. Furthermore, we could scale models 
to multiple nodes, which is essential for exascale computing, using the same code 
that is used for a single node. The separation of concerns in our approach allowed us 
to keep models code separate of underlying hardware changes. The single GGDML 
source code is used to generate code for the different architectures and on single vs. 
multiple nodes. 

To address the I/O challenge, we developed the library SCIL, a metacompressor 
supporting various lossy and lossless algorithms. It allows users to specify various 
quantities for the tolerable error and expected performance, and allows the library 
to chose a suitable algorithm. SCIL is a stand-alone library but also integrates 
into NetCDF and HDF5 allowing users to explore the benefits of using alternative 
compression algorithms with their existing middleware. We evaluated the perfor- 
mance and compression ratio for various scientific data sets and on moderate scale. 
The results show that the choice of the best algorithm depends on the data and 
performance expectation which, in turn, motivates the need for the decoupling of 
quantities from the selection of the algorithm. A blocker for applying parallel large- 
scale compression in existing NetCDF workflows is the performance limitation of 
the current HDF5 stack. 

Finally, benchmarks and mini-applications were created that represent the key 
features of the icosahedral applications. 

Beside the achieved research in the AIMES project, the work done opens the door 
for further research in the software engineering of climate and weather prediction 
models. The performance portability, where we used the same code to run on 
different architectures and machines, including single and multiple nodes, shows 
that techniques are viable to continue the research in this direction. The code 
reduction offered by DSLs promises to save millions in development cost that 
can be used to contribute to the DSL development. During the runtime of the 
project, it became apparent that the concurrently developed solutions GridTools and 
PSYclone that also provide such features are preferred by most scientists as they are 
developed by a bigger community and supported by national centers. We still believe 
that the developed light-weight DSL solution provides more flexibility particularly 
for smaller-sized models and can be maintained as part of the development of 
the models itself. It also can be used in other contexts providing domain-specific 
templates to arbitrary codes. 

In the future, we aim to extend the DSL semantics to also address I/O relevant 
specifications. This would allow to unify the effort towards optimal storage and 
computation. 
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Abstract DASH is a new programming approach offering distributed data struc- 
tures and parallel algorithms in the form of a C++ template library. This article 
describes recent developments in the context of DASH concerning the ability to exe- 
cute tasks with remote dependencies, the exploitation of dynamic hardware locality, 
smart data structures, and advanced algorithms. We also present a performance 
and productivity study where we compare DASH with a set of established parallel 
programming models. 


1 Introduction 


DASH is a parallel programming approach that realizes the PGAS (partitioned 
global address space) model and is implemented as a C++ template library. 
DASH tries to reconcile the productivity advantages of shared memory parallel 
programming with the physical realities of distributed memory hardware. To achieve 
this goal, DASH provides the abstraction of globally accessible memory that 
spans multiple interconnected nodes. For performance reasons, this global memory 
is partitioned and data locality is not hidden but explicitly exploitable by the 
application developer. 

DASH is realized as a C++ template library, obviating the need for a custom 
language and compiler. By relying on modern C++ abstraction and implementation 
techniques, a productive programming environment can be built solely based on 
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standard components. For many application developers in HPC and in general, a big 
part of the appeal of the C++ programming language stems from the availability 
of high performance generic data structures and algorithms in the C++ standard 
template library (STL). 

DASH can be seen as a generalization of concepts found in the STL to the 
distributed memory case and efforts have been made to keep DASH compatible 
with components of the STL. In many cases it is thus possible to mix and match 
algorithms and data structures freely between DASH and the STL. 

DASH is developed in the context of the SPPEXA priority programme for 
Exascale computing since 2013. In this paper we give an overview of DASH and 
report on activities within the project focusing on the second half of the funding 
period. We first give an overview of the DASH Runtime System (DART) in Sect. 2, 
focusing on features related to task execution with global dependencies and dynamic 
hardware topology discovery. In Sect. 3 we describe two components of the DASH 
C++ template library, a smart data structure that offers support for productive 
development of stencil codes and an efficient implementation of parallel sorting. 
In Sect. 4 we provide an evaluation of DASH testing the feasibility of our approach. 
We provide an outlook on future developments in Sect. 5. 


2 The DASH Runtime System 


The DASH Runtime System (DART) is implemented in C and provides an abstrac- 
tion layer on top of distributed computing hardware and one-sided communication 
substrates. The main functionality provided by DART is memory allocation and 
addressing as well as communication in a global address space. In DASH parlance 
the individual participants in an application are called units mapped to MPI 
processes in the MPI-3 remote memory access based implementation of DART. 

Early versions of DASH/DART focused on data distribution and access and 
offered no explicit compute model. This has changed with the support for tasks 
in DASH and DART. We start with a discussion of these new features, followed by 
a description of efforts to tackle increasing hardware complexity in Sect. 2.2. 


2.1 Tasks with Global Dependencies 


The benefit of decoupled transfer and synchronization in the PGAS programming 
model promises to provide improved scalability and better exploit hardware capa- 
bilities. However, proper synchronization of local and global memory accesses is 
essential for the development of correct applications. So far, the synchronization 
constructs in DASH were limited to collective synchronization using barriers and 
reduction operations as well as an implementation of the MCS lock. Using atomic 
RMA operations, users could also create custom synchronization schemes using 
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point-to-point signaling, i.e., by setting a flag at the target after completion of a 
transfer. While this approach might work for simple examples, it hardly scales to 
more complex examples where reads and writes from multiple processes need to be 
synchronized. 

The need for a more fine-grained way of synchronization that allows to cre- 
ate more complex synchronization patterns was thus imminent. The data-centric 
programming model of DASH with the distributed data structure at its core lead 
motivated us to create a synchronization that centers around these global data 
structures, i.e., which is data-centric itself. At the same time, the essential property 
of PGAS needed to be preserved: the synchronization had to remain decoupled 
from data transfers, thus not forcing users to rely solely on the new synchronization 
mechanism for data transfers. 

At the same time, the rise of task-based programming models inspired us to 
investigate the use of tasks as a synchronization vehicle, i.e., by encapsulating local 
and global memory accesses into tasks that are synchronized using a data-centric 
approach. Examples of widely known data-centric task-based synchronization 
models are OpenMPI with its task data dependencies, OmpSs, and PaRSEC. 
While PaRSEC uses data dependencies to express both synchronization and actual 
data flow between tasks, OpenMP and OmpSs use data dependencies solely for 
synchronization without affecting data movement. In contrast to PaARSEC, however, 
OpenMP and OmpSs only support shared memory parallelization. 

A different approach has been taken by HPX, which facilitates synchronization 
through the use of future/promise pairs, which form a channel between two or 
more tasks and are a concept that has been well established in the C++ community. 
However, this synchronization concept with it’s inherent communication channel 
hardly fits into the concept of a PGAS abstraction built around data structures in the 
global memory space. Moreover, DASH provides a locality-aware programming, 
in which processes know their location in the global address and can diverge their 
control accordingly, whereas HPX is a locality-agnostic programming model. 

We thus decided to focus our research efforts on distributed data dependencies, 
extending the shared memory capabilities of task data dependencies into the global 
memory space while keeping synchronization and data transfer decoupled. 


2.1.1 Distributed Data Dependencies 


Tasks in DASH are created using the async function call and passing it an action 
that will be executed by a worker thread at a later point in time. Additionally, the 
async function accepts an arbitrary number of dependency definitions of the form 
in(memory_location) and out (memory_location) to define input and output 
dependencies, respectively. In the DASH tasking model, each unit discovers it’s 
local task graph by creating tasks operating mainly in the local portion of the 
global memory space, i.e., tasks are never transferred to other units. This locality- 
awareness limits the number of tasks to be discovered to only the tasks that will 
eventually be executed in that unit: as depicted in Fig. 1. 
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Fig. 1 Distributed task graph discovery: the unit in the middle only discovers its local tasks (green) 
and should only be concerned with tasks on other units that have dependencies to its local tasks 
(blue). All other tasks (gray) should not be considered 


Synchronization 
Task Creation : Task Matching Task Execution 


Fig. 2 Scheduler interaction required for handling remote task data dependencies 


The discovery of the local task graphs thus happens in parallel and without 
immediate synchronization between the units, i.e., only the bold edges in Fig. 1 are 
immediately visible to the individual scheduler instances. In order to connect these 
trimmed task graphs, the schedulers need to exchange information on dependencies 
crossing its process boundary, i.e., dependencies referencing non-local global mem- 
ory locations. The required interaction between the schedulers is depicted in Fig. 2. 
During the discovery of the trimmed local task graphs, schedulers communicate 
any encountered dependencies that reference non-local global memory to the 
unit owning the referenced memory (1). As soon as all dependencies have been 
communicated, the schedulers extend their local task graphs with the dependency 
information received from other units @). A synchronization across all units is 
required to ensure that all relevant dependency information has been exchanged. 

After the extension of the local task graphs, the units start the execution of the 
tasks. As soon as a task with a dependency to a task on a remote unit, e.g., through 
an input dependency previously communicated by the remote unit, has completed, 
the dependency release is communicated to the remote unit G), where the task will 
eventually be executed. The completion of the execution is then communicated 
back to the first scheduler (® to release any write-after-read dependency, e.g., the 
dependency Cz — Bz in Fig. 1. 
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2.1.2 Ordering of Dependencies 


As described in Sect.2.1.1, the local task graphs are discovered by each unit 
separately. Local edges in the local task graph are discovered similar to the matching 
rules of OpenMP, i.e., an input dependency refers to the previous output dependency 
referencing the same memory location (read-after-write), which in turn match with 
previous input and output dependencies on the same memory location (write-after- 
read and write-after-write). 

However, since the local task graphs are discovered in parallel, the sched- 
ulers cannot infer any partial ordering of tasks and dependencies across process 
boundaries. More specifically, the blue scheduler in Fig. 1 cannot determine the 
relationship between the dependencies of tasks B1, B2, C2, C4. The schedulers thus 
have to rely on additional information provided by the user in the form of phases (as 
depicted in Fig. 1). A task and its output dependencies are assigned to the current 
phase upon their discovery. Input dependencies always refer to the last matching 
output dependency in any previous phase while output dependencies match with any 
previous local input dependency in the same or earlier phase and any remote input 
dependency in any earlier phase, up to and including the previous output dependency 
on the same memory location. 

As an example, the input dependency of C2 is assigned the phase N + 1 whereas 
the input dependency of C4 is assigned the phase N + 3. This information can be 
used to match the output dependency of Bı in phase N to the input dependency of 
C2 and the output dependency of Bz in phase N + 2 to the input dependency of 
C4, creating the edges Bı — Cz and B2 — C4. The handling of write-after-read 
dependencies described in Sect. 2.1.1 creates the edge C2 — B2. The handling of 
local dependencies happens independent of the phase. 

In our model, conflicting remote dependencies in the same phase are erroneous 
as the scheduler is unable to reliably match the dependencies. Two dependencies 
are conflicting if at least one of them is non-local and at least one is an output 
dependency. This restriction allows the schedulers to detect synchronization errors 
such as underdefined phases and report them to the user. This is in contrast to the 
collective synchronization through barriers traditionally used in DASH, in which 
synchronization errors cannot be easily detected and often go unnoticed unless the 
resulting non-deterministic behavior leads to deviations in the application’s results. 


2.1.3 Implementation 


A single task is created using the async function in DASH, which accepts both an 
action to be performed when the task is executed and a set of dependencies that 
describe the expected inputs and outputs of the task. In the example provided in 
Listing 1, every call to async (lines 5, 11, 19, and 27) is passed a C+ lambda in 
addition to input and output dependencies. 

Instead of pure input dependencies, the example uses copyin dependencies, 
which combine an input dependency with the transfer of the remote memory range 
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into a local buffer. This allows for both a more precise expression of the algorithms 
and allows the scheduler to map the transfer onto two-sided MPI communication 
primitives, which may be beneficial on systems that do not efficiently support MPI 
RMA. The action performed by the task could still access any global memory 
location, keeping communication and synchronization decoupled in principle and 
retaining a one-sided programming model while allowing the use of two-sided 
communication in the background. 


1 dash:: Matrix <2, double> matrix{N, N, dash::TILE(NB), dash:: TILE(NB) }; 


for (int k = 0; k < num_blocks; ++k) { 

4 if (mat. block(k,k).is_local()) { 
5 dash:: tasks::asynce([&](){ potrf(matrix.block(k,k)); }, 
6 dash:: tasks :: out(mat. block(k,k))); 

} 

dash:: tasks::async_fence(); // <— advance to next phase 
9 for (int i = k+l; i < num blocks: ++i) 
10 if (mat. block(k,i).is_local()) 
11 dash:: tasks :: async ([&]() { 
12 trsm(cache[k], matrix. block(k,i)); }, 
13 dash:: tasks :: copyin(mat.block(k,k), cache[k]), 
14 dash:: tasks :: out(mat.block(k,i))); 
15 dash:: tasks:: async fence // <— advance to next phase 
16 for (int i = k+l; i < num blocks: ++i) { 
l før Cint 1 s Etik J << ae Arr ff 
18 if (mat.block(j,i).is_local()) { 
19 dash:: tasks :: async ([&](){ 

ee 

20 gemm(cache[i], cache[j], mat.block(j,i)); }, 
21 dash:: tasks :: copyin(mat.block(k,i), cache[i]), 
22 dash:: tasks :: copyin(mat.block(k,j), cache[j]), 
dash:: tasks :: out(mat.block(j,i))); 


24 } 
25 } 
26 if (mat.block(i,i).is_local()) { 
2 dash:: tasks :: async ([&]() { 
28 syrk(cache[i], mat.block(i,i)); }, 
29 dash:: tasks :: copyin(mat.block(k,i), cache[i]), 
30 dash:: tasks :: out(mat.block(i,i))); 
31 } 
2 } 

dash:: tasks::async_fence(); // <— advance to next phase 
i 


35 dash::tasks::complete(); // <— wait for all tasks to execute 


Listing 1 Implementation of Blocked Cholesky Factorization using global task data dependencies 
in DASH. Some optimizations omitted for clarity 


The specification of phases is done through calls to the async_fence function 
(lines 8, 15, and 33 in Listing 1). Similar to a barrier, it is the user’s responsibility 
to ensure that all units advance phases in lock-step. However, the phase transition 
triggered by async_fence does not incur any communication. Instead, the call 
causes an increment of the phase counter, whose new value will be assigned to 
all ensuing tasks. 

Eventually, the application waits for the completion of the execution of the global 
task graph in the call to complete (). Due to the required internal synchronization 
and the matching of remote task dependencies, the execution of all but the tasks in 
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the first phase has to be post-poned until all its dependencies in the global task-graph 
are known. DASH, however, provides the option to trigger intermediate matching 
steps triggered by a phase increment and allows the specification of a an upper 
bound on the number of active phases! to avoid the need for discovering the full 
local task graph before execution starts. This way the worker threads executing 
threads may be kept busy while the main thread continues discovering the next 
window in the task graph. 

In addition to the single task creation construct described above, DASH also 
provides the taskloop() construct, for which an example is provided in Listing 2. 
The taskloop function divides the iteration space [begin, end) into chunks that 
are assigned to tasks, which perform the provided action on the assigned subrange 
(lines 7-9). The user may control the size of each chunk (or the overall number of 
chunks to be created) by passing an instance of chunk_size (or num_chunks) to the 
call (Line 5). In addition, the call accepts a second lambda that is used to specify the 
dependencies of each task assigned a chunk (lines 11-14), which allows a depth- 
first scheduler to chain the execution of chunks of multiple data-dependent loops, 
effectively improving cache locality without changing the structure of the loops. 


1 dash: :Array<int> arr (N); 


s if (dash::myid() == 0) { 
dash: :tasks: :taskloop ( 
arr begin} p arr endl): dash: tasks: chunkss zeT o) 


6 // task action 

[&] (auto begin, auto end) { 
8 // perform action on elements in [begin, end) 
9 }, 


10 // generate out dependencies on elements in [begin, end) 
il [&] (auto begin, auto end, auto deps) { 

12 iene lewe Sig = leystepliae slic Wes Eme erie 

13 *deps = dash::tasks::out (it); 
1 


Me 


Listing 2. Example of using the dash: : taskloop in combination with a dependency generator 


2.1.4 Results: Blocked Cholesky Factorization 


The implementation of the Blocked Cholesky Factorization discussed in Sect. 2.1.3 
has been compared against two implementations in PaRSEC. The first implementa- 
tion uses the parameterized task graph (PTG), in which the problem is described as 
an directed acyclic graph in a domain-specific language called JDF. In this version, 


'A phase is considered active while a task discovered in that phase has not completed execution. 
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the task graph is not dynamically discovered but is instead inherently contained 
within the resulting binary. 

The second PaRSEC version uses the Dynamic Task Discovery (DTD) interface 
of PaRSEC, in which problems are expressed with a global view, i.e., all processes 
discover the global task graph to discover the dependencies to tasks executing on 
remote processes. 

For all runs, a background communication thread has been employed, each time 
running on a dedicated core, leading to one main thread and 22 worker threads 
executing the application tasks on the Cray XC40. 

The results presented in Fig. 3 indicate that PARSEC PTG outperforms both DTD 
and DASH, due to the missing discovery of tasks and their dependencies. DTD 
exhibits a drop in per-node performance above 64 nodes, which may be explained 
with the global task graph discovery. Although the per-node performance of DASH 
does not exhibit perfect scaling, it still achieves about 80% of the performance of 
PaRSEC PTG at 144 nodes. 


2.1.5 Related Work 


HPX [18] is an implementation of the ParalleX [17] programming paradigm, in 
which tasks are spawned dynamically and moved to the data, instead of the data 
being moved to where the task is being executed. HPX is locality-agnostic in that 
distributed parallelism capabilities are implicit in the programming model, rather 
than explicitly exposed to the user. An Active Global Address Space (AGAS) is 
used to transparently manage the locality of global objects. Synchronization of tasks 
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is expressed using futures and continuations, which are also used to exchange data 
between tasks. 

In the Active Partitioned Global Address Space (APGAS) [27], in contrast, the 
locality of so-called places is explicitly exposed to the user, who is responsible for 
selecting the place at which a task is to be executed. Implementations of the APGAS 
model can be found in the X10 [9] and Chapel [8] languages as well as part of UPC 
and UPC+-+ [21]. 

The Charm+ programming system encapsulates computation in objects that 
can communicate using message objects and can be migrated between localities 
to achieve load balancing. 

Several approaches have been proposed to facilitate dynamic synchronization 
by waiting for events to occur. AsyncShmem is an extension of the OpenShmem 
standard, which allows dynamic synchronization of tasks across process boundaries 
by blocking tasks waiting for a state change in the global address space [14]. The 
concept of phasers has been introduced into the X10 language to implement non- 
blocking barrier-like synchronization, with the distinction of readers and writers 
contributing to the phaser [29]. 

Tasklets have recently been introduced to the XcalableMP programming 
model [35]. The synchronization is modeled to resemble message-based 
communication, using data dependencies for tasks on the same location and notify- 
wait with explicitly specified target and tags. 

Regent is a region- and task-based programming language that is compiled 
into C++ code using the Legion programming model to automatically partition the 
computation into logical regions [4, 30]. 

The PaRSEC programming system uses a domain specific language called JDF 
to express computational problems in the form of a parameterized task graph 
(PTG) [7]. The PTG is implicitly contained in the application and not discovered 
dynamically at runtime. In contrast to that, the dynamic task discovery (DTD) 
frontend of PaRSEC dynamically discovers the global task-graph, i.e., each process 
is aware of all nodes and edges in the graph. 

A similar approach is taken by the sequential task flow (STF) frontend of StarPU, 
which complements the explicit MPI send/recv tasks to encapsulate communication 
in tasks and implicitly express dependencies across process boundaries [1]. 

Several task-based parallelization models have been proposed for shared memory 
concurrency, including OpenMP [3, 24], Intel thread building blocks (TBB) [25] and 
Cilk+ [26] as well as SuperGlue [34]. With ClusterSs, an approach has been made 
to introduce the APGAS model into OmpSs [32]. 


2.2 Dynamic Hardware Topology 


Portable applications for heterogeneous hosts adapt communication schemes and 
virtual process topologies depending on system components and the algorithm 
scenario. This involves concepts of vertical and horizontal locality that are not based 
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Fig. 4 Strong scaling of matrix multiplication on single node for 4 to 32 cores with increasing 
matrix size N x N on Cori phase 1, Cray MPICH 


on latency and throughput as distance measure. For example in a typical accelerator- 
offloading scenario, data distribution to processes optimizes for horizontal locality 
to reduce communication distance between collaborating tasks. For communication 
in the reduction phase, distance is measured based on vertical locality. 

The goal to provide a domain-agnostic replacement for the C++ Standard 
Template Library (STL) implies portability as a crucial criterion for every model 
and implementation of the DASH library. This includes additional programming 
abstractions provided in DASH, such as n-dimensional containers which are 
commonly used in HPC. These are not part of the C++ standard specifications but 
comply with its concepts. Achieving portable efficiency of PGAS algorithms and 
containers that satisfy semantics of their conventional counterparts is a multivariate, 
hard problem, even for the seemingly most simple use cases. 

Performance evaluation of the of the DASH NArray and dense matrix-matrix 
multiplication abstractions on different system configurations substantiated the 
portable efficiency of DASH. The comparison also revealed drastic performance 
variance of the established solutions, for example node-local DGEMM of Intel 
MKL on Cori phase 1 shown in Fig. 4 which apparently expected a power of two 
amount of processing cores for multi-threaded scenarios. 

The DASH variant of DGEMM internally uses the identical Intel MKL distribu- 
tion for multiplication of partitioned matrix blocks but still achieves robust scaling. 
This is because DASH implements a custom, adaptive variant of the SUMMA 
algorithm for matrix-matrix multiplication and assigns one process per core, each 
using MKL in sequential mode. This finding motivated to find abstractions that 
allow expressions for domain decomposition and process placement depending on 
machine component topology. In this case: to group processes by NUMA domains 
with one process per physical core. 


2.2.1 Locality-Aware Virtual Process Topology 


In the DASH execution model, individual computation entities are called units. In 
the MPI-based implementation of the DASH runtime, a unit corresponds to an MPI 
rank but may occupy a locality domain containing several CPU cores or, in principle, 
multiple compute nodes. 
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Units are organized in hierarchical teams to match the logical structure of 
algorithms and machine components. Each unit is an immediate member of exactly 
one team at any time, initially in the predefined team ALL. Units in a team can be 
partitioned into child teams using the team’s split operation which also supports 
locality-aware parameters. 

On systems with asymmetric or deep memory hierarchies, it is highly desirable 
to split a team such that locality of units within every child team is optimized. A 
locality-aware split at node level could group units by affinity to the same NUMA 
domain, for example. For this, locality discovery has been added to the DASH 
runtime. Local hardware information from hwloc, PAPI, libnuma, and LIKWID 
of all nodes is collected into a global, uniform data structure that allows to query 
locality information by process ID or scope in the memory hierarchy. 

This query interface proved to be useful for static load balancing on heteroge- 
neous systems where team are split depending on the machine component capacities 
and capabilities. These are stored in a hierarchy of domains with two property maps: 


Capabilities invariant hardware locality properties that do not depend on the 
locality graph’s structure, like the number of threads per core, cache sizes, or 
SIMD width 

Capacities derivative properties that might become invalid when the graph 
structure is modified, like memory in a NUMA domain available per unit 


Figure 5 outlines the data structure and its concept of hardware abstraction in a 
simplified example of a topology-aware split. Domain capacities are accumulated 
from its subdomains and recalculated on restructuring. Team | and 2 both contain 
twelve cores but a different number of units. A specific unit’s maximum number of 
threads is determined by the number of cores assigned to the unit and the number of 
threads per core. 


GLOBAL accumulated invariant 
L:0 domain capacities domain capabilities 
eee units: { 0,1,23 F 
cores: 24 
SCOPE .--------------- NODE mem: 32 GB m.access: shared 


bytes/s: 


level .--------------- eal .@ 
domain tag .------------------------- h- 7 
units: 02,3 7 
cores: 12 
NUMA mem: 16 GB m.access: shared 


bytes/s: 


units: qf 3} 
cores: 6 m.access: shared 
—jmem: 8 GB bytes/s: 64.3G 
. . : SIMD: 512 
TE E32) O- ~~~] threads/c: 2 


: team 2 i 


Fig. 5 Domains in a locality hierarchy with domain attributes in dynamically accumulated 
capacities and invariant capabilities 
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2.2.2 Locality Domain Graph 


The machine component topology of the DASH runtime to support queries and 
topology-aware restructuring extends the tree-based hwloc topology model to 
represent properties and relations of machine components in a graph structure. It 
evolved to the Locality Domain Graph (LDG) concept which is available as the 
standalone library dyloc.? 

In formal terms, a locality domain graph models hardware topology as directed, 
multi-indexed multigraph. In this, nodes represent Locality Domains that refer 
to any physical or logical component of a distributed system with memory and 
computation capabilities, corresponding to places in X10 or Chapel’s locales [8]. 
Edges in the graph are directed and denote the following relationships, for example: 


e Containment indicating that the target domain is logically or physically contained 
in the source domain 

e Alias source and target domains are only logically separated and refer to the same 
physical domain; this is relevant when searching for a shortest path, for example 

e Leader the source domain is restricted to communication with the target domain 


2.2.3 Dynamic Hardware Locality 


Dynamic locality support requires means to specify transformations on the physical 
topology graph as views. Views realize a projection but must not actually modify the 
original graph data. Invariant properties are therefore stored separately and assigned 
to domains by reference only. 

Conceptually, multi-index graph algebra can express any operation on a locality 
domain graph, but complex to formulate. When a topology is projected to an acyclic 
hierarchy, transformations like partitioning, selection and grouping of domains can 
be expressed in conventional relational or set semantics. A partition or contraction 
of a topology graph can be projected to a tree data structure and converted to a hwloc 
topology object (Fig. 6). 

A locality domain topology is specific to a team and only contains domains that 
are populated by the team’s units. At initialization, the runtime initializes the default 
team ALL as root of the team hierarchy with all units and associates the team with 
the global locality graph containing all domains of the machine topology. When a 
team is split, its locality graph is partitioned among child teams such that a single 
partition is coherent and only contains domains with at least one leaf occupied by a 
unit in the child team. 

In a map-reduce scenario, dynamic views on machine topology to express for 
domain decomposition and process placement depending on machine component 


*https://github.com/dash-project/dyloc. 


DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space 115 


NUMA : 


DDR |- 


DDR |- 


Fig. 6 Illustration of a hardware locality domain graph as a model of node-level system architec- 
tures that cannot be correctly or unambiguously represented in a single tree structure. (a) Cluster 
in Intel Knights Landing, configured in Sub-NUMA clustering, Hybrid mode. Contains a quarter 
of the processor’s cores, MCDRAM local memory, affine to DDR NUMA domain. (b) Exemplary 
graph representation of Knights Landing topology in (a). Vertex categories model different aspects 
of component relationships, like cache-coherence and adjacency 
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Fig. 7 Illustration of the domain grouping algorithm to define a leader group for vertical 
communication. One core is selected as leader in domains 100 and 110 and separated into a group. 
To preserve the original topology structure, the group includes their parent domains and is added 
as a subdomain of their lowest common ancestor 


topology and improve portable efficiency. In the map phase, the algorithm is 
mostly concerned with horizontal locality in domain decomposition to distribute 
data according to the physical arrangement of cooperating processes. In the reduce 
phase, vertical locality of processes in the component topology determines efficient 
upwards communication of partial results. The locality domain graph can be used to 
project hardware topology to tree views for both cases. Figure 7 illustrates a locality- 
aware split of units in two modules such that one unit per module is selected for 
upwards communication. This principle is known as leader communication scheme. 
Partial results of units are then first reduced at the unit in the respective leader team. 
This drastically reduces communication overhead as the physical bus between the 
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Fig. 8 Using dyloc as intermediate process in locality discovery 


modules and their NUMA node is only shared by two leader processes instead of all 
processes in the modules (Fig. 8). 


2.2.4 Supporting Portable Efficiency 


As an example of both increased depth of the machine hierarchy and heterogeneous 
node-level architecture, the SuperMIC system? consists of 32 compute nodes with 
symmetric hardware configuration of two NUMA domains, each containing an Ivy 
Bridge (8 cores) host processor and a Xeon Phi “Knights Corner” coprocessors 
(Intel MIC 5110P) as illustrated in Fig. 9. 

For portable work load balancing on heterogeneous systems, domain decomposi- 
tion and virtual process topology must dynamically adapt the machine components’ 
inter-connectivity, capacities and capabilities. 


Capacities: Total memory capacity on MIC modules is 8GB for 60 cores, 
significantly less than 64 GB for 32 cores on host level 

Capabilities: MIC cores have a base clock frequency of 1.1GHz and 4 SMT 
threads, with 2.8 GHz and 2 SMT threads on host level 


To illustrate the benefit of dynamic locality, we briefly discuss the implementa- 
tion of the min_element algorithm in DASH. Its original variant is implemented 
as follows: domain decomposition divides the element range into contiguous blocks 
of identical size. All units then run a thread-parallel scan on their local block for a 
local minimum and enter a collective barrier once it has been found. Once all units 
finished their local work load, local results are reduced to the global minimum. 

Listing 3 contains the abbreviated implementation of the min element sce- 
nario utilizing runtime support based on a dynamic hardware locality graph. 


$https://www.Irz.de/services/compute/supermuc/supermic. 
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Fig. 9 SuperMIC node 
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Dynamic topology queries are utilized in three essential ways to improve overall 
load-balance: In domain decomposition (lines 3—6), to determine the number of 
threads available to the respective unit (line 19) and for a simple leader-based 
communication scheme (lines 8—10, 26). 

This implementation achieves portable efficiency across systems with different 
memory hierarchies and hardware component properties, and dynamically adapts to 
runtime-specific team size, range size, and available hardware components assigned 
to the team. Figure 10 shows timeline plots comparing time to completion and 
process idle time from a benchmark run executed on SuperMIC. 


1 // Dynamic topology-aware domain decomposition depending on 
2 // machine component properties and number of units in team: 


3 TeamLocality tloc (dash: :Team::All1()); 
4 LocBalancedPattern pattern(array_size, tloc); 
5 dash: :Array<T> array (pattern) ; 


6 


7 GlobIt min_element (GlobIt first, GlobIt last) { 


8 auto uloc = UnitLocality (myid()); 

9 auto leader = uloc.at_scope (scope: : MODULE) 
10 ieie ice O O 

11 automioc Emini ESE 


13 // Allocate shared variable for reduction result at leader: 
4 dash::Shared<GlobIt> glob min (leader); 
15 // Allocate shared array for local minimum values: 
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16 dash: :Array<GlobIt>(dash::Team::All1().size()) loc_mins; 


c 
= 


/ Dynamic query of locality runtime for number of threads: 
19 auto nthreads = uloc.num_threads () ; 


20 #pragma omp parallel for num_threads (nthreads) 
21 for Ge) / eae tind local result .-- 4/1) 
2 // Local write, no communication 

23 loc_mins[my_id] = loc min; 


24 dash: :barrier(); 


2% if (myid() == leader) { 

7 // leader reduces local results (instead of all-to-all 
28 // reduction) 

29 glob _min = std::min_ element (loc _mins.begin(), 

30 loc_mins.end())j; 

31 

32 } 

mB it 

34 } 


\ 


Listing 3 Code excerpt of the modified min_element algorithm 


Trace of process states in min_element on SuperMIC node 
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Fig. 10 Trace of process activities in the min _element algorithm exposing the effect of load 
balancing based on dynamic hardware locality 
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3 DASH C++ Data Structures and Algorithms 


The core of DASH is formed by data structures and algorithms implemented as 
C+-+ templates. These components are conceptually modeled after their equivalents 
in the C++ standard template library, shortening learning curves and increasing 
programmer productivity. A basic DASH data structure is the distributed array 
with configurable data distribution (dash: :Array) which closely follows the 
functionality of a STL vector except for the lack of runtime resizing. DASH 
also offers a multidimensional array and supports a rich variety of data distribution 
patterns [11]. A focus of the second half of the funding period was placed on smart 
data structures which are more specialized and support users in the development 
of certain types of applications. One such data structures for the development of 
stencil-based applications is described in Sect. 3.1. 

Similar to data structures, DASH also offers generalized parallel algorithms. 
Many of the over 100 generic algorithms contained in the STL have an equivalent 
in DASH (e.g., dash: : £111). One of the most useful but also challenging algo- 
rithms is sorting and Sect. 3.2 describes our implementation of scalable distributed 
sorting in the DASH library. 


3.1 Smart Data Structures: Halo 


Typical data structure used in ODE/PDE solvers or 2D/3D image analyzers 
are multi-dimensional arrays. The DASH NArray distributes data elements of a 
structured data grid and can be used similar to STL containers. But PDE solvers 
use stencil operations, not using the current data elements (center), but surrounding 
data elements (neighbors) as well. The use of the NArray it self is highly inefficient 
with stencil operations, because neighbors located in another sub-arrays may require 
remote access (via RDMA or otherwise). A more efficient approach is the use of so 
called “halo areas”. These areas contain copies of all required neighbor elements 
located on other compute nodes. The halo area width depends on the shape of the 
stencils and is determined by the largest distance from the center (per dimension). 
The stencil shape defines all participating data elements—center and neighbors. 
Figure | 1 shows two 9-point stencils with different shapes. The first stencil shape (a) 
accesses +2 data elements in both horizontal and vertical direction and the second 
one (b) accesses +1 stencil point in each direction. While the stencil shape Fig. 11a 
needs four halo areas with a width of two data elements. The other stencil shape 
requires eight halo areas with a width one data elements. Using halo areas ensures 
local data access for all stencil operations used on each sub-array. 

The Dash NArray Halo Wrapper wraps the local part of the NArray and 
automatically sets up a halo environment for stencil codes and halo accesses. 
Figure 12 shows an overview about all main components, which are explained in 
the following. 
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Fig. 11 Two shapes of a 9-point stencil. (a) +2 center stencil in horizontal and vertical directions. 
(b) Center +1 stencil point in each direction 
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Fig. 12 Architecture of the DASH Halo NArray Wrapper 
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3.1.1 Stencil Specification 


The discretization of the problem to be solved always determines the stencil shape 
to be applied on the structured grid. For a DASH based code this has to be specified 
as a stencil specification (StencilSpec), which is a collection of stencil points 
(StencilPoint). A StencilPoint consists of coordinates relative to the center and an 
optional weight (coefficient). The StencilSpec specified in Listing 4 describes the 
stencil shown in Fig. 1 1b. The center doesn’t have to be declared directly. 


1 using PointT = dash::halo::StencilPoint<25; 

2 dash: :halo::StencilSpec<PointT,6> stencil spec ( 

3 Pointl EE N 0), EO E 

4 PointT( 0,-1), , PointT ( 
Eoria al, il), iwosliaieur( al, Oy PONE 


Listing 4 Stencil specification for an 9-point stencil 


3.1.2 Region and Halo Specifications 


The region specification (RegionSpec) defines the location of all neighboring 
partitions. Every unit keeps 3” regions representing neighbor partitions for “left”, 
“middle”, and “right” in each of the n dimensions. All regions are identified by a 
region index and a corresponding region coordinate. Indexing are done with the Row 
Major linearization (last index grows fastest). Figure 13 shows all possible regions 
with its indexes and coordinates for a two dimensional scenario. Region 4 with the 
coordinates (1,1) is mapped to the center region and represents the local partition. 
Region 6 (2,0) points to a remote partition located in the south west. 


Fig. 13 Mapping of region 
coordinates and indexes 


0 1 2 
(0,0) (0,1) (0,2) 
NW N NE 


3 5 
(1,0) (1,2) 
w E 
6 7 8 
(2,0) (2,1) (2,2) 


122 K. Fiirlinger et al. 


The halo specification (HaloSpec) uses the RegionSpec to map neighbor par- 
titions which are the origins for halo copies to the local halo areas. From one or 
multiple StencilSpecs it infers which neighbor partitions are necessary. In case no 
StencilPoint has a negative offset from the center in horizontal direction, no halo 
regions for the ‘NW’, ‘W’, and ‘SW’ (Fig. 13) need to be created. If no StencilPoint 
has diagonal offsets (i.e. only one non-zero coordinate in the offsets) the diagonal 
regions ‘NW’,‘NE’, ‘SW’, and ‘SE’ can be omitted. 


3.1.3 Global Boundary Specification 


Additionally, the global boundary specification (GlobalBoundarySpec) allows to 
control the behavior at the outside of the global grid. For convenience, three different 
scenarios are supported. The default setting is NONE meaning that there are no 
halo areas in this direction. Therefore, the stencil operations are not applied in 
the respective boundary region where the stencil would require the halo to be 
present. As an alternative, the setting CYCLIC can be set. This will wrap around 
the simulation grid, so that logically the minimum coordinate becomes a neighbor 
to the maximum coordinate. Furthermore, the setting CUSTOM creates a halo area 
but never performs automatic update of its elements from any neighbors. Instead, 
this special halo area can be written by the simulation (initially only or updated 
regularly). This offers a convenient way to provide boundary conditions to a PDE 
solver. The GlobalBoundarySpec can be defined separately per dimension. 


3.1.4 Halo Wrapper 


Finally, using the aforementioned specifications as inputs, the halo wrapper 
(HaloWrapper) creates HaloBlocks for all local sub-arrays. The halo area extension 
is derived from all declared StencilSpecs by determining the maximum offset of 
any StencilPoint in the given direction. 

The mapping between the local HaloBlocks and the halo regions pointing to 
the remote neighbor data elements is subjected to the HaloWrapper, as well as 
the orchestration of efficient data transfers for halo data element updates. The 
data transfer has to be done block-wise instead of element-wise to gain decent 
performance. While the HaloBlock can access contiguous memory, the part of the 
neighbor partition marked as halo area, usually can’t be accessed contiguously— 
compare Fig. 14. Therefore, the HaloWrapper relies on DART’s support for efficient 
strided data transfers. 

The halo data exchange can be done per region or for all regions at once. It can be 
called asynchronously and operates independent between all processes and doesn’t 
use process synchronization. The required subsequent wait operation waits for local 
completion only. 
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Fig. 14 Halo data exchange of remote strided data to contiguous memory: (a) in 2D a corner halo 
region has a fixed stride, whereas (b) in 3D the corner halo region has two different strides 


3.1.5 Stencil Operator and Stencil Iterator 


So far, the HaloWrapper was used to create HaloBlocks to fit all given StencilSpecs. 
Besides that, the HaloWrapper also provides specific views and operations for each 
StencilSpec. 

First, for every StencilSpec the HaloWrapper provides a StencilOperator with 
adapted inner and boundary views. The inner view contains all data elements 
that don’t need a halo area when using the given stencil operation. All other data 
elements are marked via the boundary view. These two kind of views are necessary 
to overlap the halo data transfer with the inner computation. The individual view per 
StencilSpec allows to make the inner view as large as possible, regardless of other 
StencilSpecs. 

Second, the HaloWrapper offers StencilSpec specific Stencillterators. They 
iterate over all elements assigned by a given view (inner) or a set of views 
(boundary). With these iterators center elements can be accessed—equivalent to 
STL iterators—via the dereference operator. Neighboring data elements can be 
accessed with a provided method. Stencil points pointing to elements within a 
halo area, are resolved automatically without conditionals in the surrounding code. 
Stencillterators can be used with random access, but are optimized for the increment 
operator. 
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3.1.6 Performance Comparison 


A code abstraction hiding complexity is useful only, if no or minor performance 
impact is added. Therefore, a plain MPI implementation of a heat equation was 
compared to a DASH based one regarding weak and strong scaling behavior. All 
measurements were performed on the Bull HPC-Cluster “Taurus” at ZIH, TU 
Dresden. Each compute node has two Haswell E5-2680 v3 CPUs at 2.50 GHz with 
12 physical cores each and 64GB memory. Both implementations were built with 
gcc 7.1.0 and OpenMPI 3.0.0. 

The weak scaling scenario increases the number of grid elements proportional 
to the number of compute nodes. The accumulated main memory is almost entirely 
used up by each compute grid. Figure 15 shows that both implementations almost 
have identical and perfect weak scaling behavior. Note that the accumulated waiting 
times differ significantly. This is due to two effects. One is contiguous halo areas 
(north and south) vs. strided halo areas (east and west). The other is intra node vs. 
inter node communication. 

The strong scaling scenario uses 55,0007 grid elements to fit into the main 
memory of a single compute node. It is solved with 1 to 768 CPU cores (== MPI 
ranks), where 24 cores equals to one full compute node and 768 cores to 32 compute 
nodes. Figure 16 shows again an almost identical performance behavior between 
DASH and MPI for the total runtime. Notably, both show the same performance 
artifact around 16 to 24 cores. This can be ascribed to an increased number of 
last level cache load misses which indicates that both implementations are memory 
bound at this number of cores per node. 


1024 | | i I i i " TMPI total E j} 
g [ó] 5 [ó] lal Q gm DASH total a | 
MPI boundary k 
L DASH boundary 
ile MPI wait o 
DASH wait . 
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Fig. 15 Weak scaling in DASH vs. MPI 
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Fig. 16 Strong scaling for 55,000 x 55,000 elements in DASH vs. MPI. Effective wait time for 
asynchronous halo exchanges is shown in addition 


3.2 Parallel Algorithms: Sort 


Sorting is one of the most important and well studied non-numerical algorithms 
in computer science and serves as a basic building block in a wide spectrum of 
applications. A notable example in the scientific domain are N-Body particle simu- 
lations which are inherently communication bound due to load imbalance. Common 
strategies to mitigate this problem include redistributing particles according to a 
space filling curve (e.g., Morton Order) which can be achieved with sorting. Other 
interesting use cases which can be addressed using DASH are Big Data applications, 
e.g., Google PageRank. 

Key to achieve performance is obviously to minimize communication. This 
applies not only to distributed memory machines but to shared memory architectures 
as well. Current supercomputers facilitate nodes with large memory hierarchies 
organized in a growing number of NUMA domains. Depending on the data 
distribution, sorting is subject to a high fraction of data movement and the more we 
communicate across NUMA boundaries the more negative the result performance 
impact becomes. 

In the remainder of this section we briefly describe the problem of sorting in 
a more formal manner and summarize the basic approaches in related work. It 
follows a more detailed elaboration of our sorting algorithm [20]. Case studies 
on both distributed and shared memory demonstrate our performance efficiency. 
Results reveal that we can outperform state of the art implementations with our 
PGAS algorithm. 
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3.2.1 Preliminaries 


Let X be a set of N keys evenly partitioned among P processors, thus, each 
processor contributes n; ~ N/P keys. We further assume there are no duplicate 
keys which can technically be achieved in a straightforward manner. Sorting 
permutes all keys by a predicate which is a binary relation in set X. Recursively 
applying this predicate to any ordered pair (x, y) drawn from X enables to determine 
the rank of an element Z (x) = k with x as the k-th order statistic in X. Assuming 
our predicate is less than (i.e., <) the output invariant after sorting guarantees that 
for any two subsequent elements x, y € X 


x< ye I(x)< I(y). 


Scientific applications usually require a balanced load to maximize performance. 
Given a load balance threshold €, local balancing means that in the sorted sequence 
each processor P; owns at most N(1 + €)/P keys. This does not always result in a 
globally balanced load which is an even stronger guarantee. 


Definition 1 For all i € {1..P} we have to determine splitter S; to partition the 
input sequence into P subsequences such that 


Ni _ We EE E 
P ap Pp 2P 

Determining these splitters boils down to the k-way selection problem which is a 

core algorithm in this work. If € = 0 we need to perfectly partition the input which 

increases communication complexity. However, it often is the easiest solution in 

terms of programming productivity which is a major goal of the DASH library. 


3.2.2 Related Work 


Sorting large inputs can be achieved through parallel sample sort which is a 
generalization of Quicksort with multiple pivots [5]. Each processor partitions local 
elements into p pieces which are obtained out of a sufficiently large sample of 
the input. Then, all processors exchange elements among each other such that 
piece i is copied to processor i. In a final step, all processors sort received pieces 
locally, resulting in a globally sorted sequence. Perfect partitioning can be difficult 
to achieve as splitter selection is based only on a sample of the input. 

In parallel p-way mergesort each processor first sorts the local data portion and 
subsequently partitions it, similar to sample sort, into p pieces. Using an ALL- 
TO-ALL exchange all pieces are copied to the destination processors which finally 
merge them to obtain a globally sorted sequence. Although this algorithm has worse 
isoefficiency due to the partitioning overhead compared to sample sort, perfect 
partitioning becomes feasible since data is locally sorted. 
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Scalable sorting algorithms are compromises between these two extremes and 
apply various strategies to mitigate negative performance impacts of splitter selec- 
tion (partitioning) and ALL-TO-ALL communication [2, 15, 16, 31]. Instead of 
communicating data only once, partitioning is done recursively from a coarse- 
grained to a more fine-grained solution. Each recursion leads to independent 
subpartitions until the solution is found. Ideally, the level of recursion maps to the 
underlying hardware resources and network topology. 

This work presents two contributions. First, we generalize a distributed selection 
algorithm to achieve scalable partitioning [28]. Second, we address the problem 
of communication-computation overlap in the ALL-TO-ALL exchange, which is 
conceptually limited in MPI as the underlying communication substrate. 


3.2.3 Histogram Sort 


The presented sorting algorithm consists of four supersteps as delineated in Fig. 17. 


Local Sort Sorts the local portion using a fast shared memory algorithm. 

Splitting Each processor partitions the local array into p pieces. We generalize 
distributed selection to a p-way multiselect. 

Data Exchange Each processor exchanges piece i with processor i according to 
the splitter boundaries. 

Local Merge Each processor merges the received sorted pieces. 


Splitting is based on distributed selection [28]. Instead of finding one pivot we 
collect multiple pivots (splitters) in a single iteration, one for each active range. If 
a pivot matches a specific rank we do not consider this range anymore and discard 
it from the set of active ranges. Otherwise, we examine each of the two resulting 


Fast shared memory sort. 


Parallel k-way select with O(log? P(N/P)). 


n i i to i : i i 
All-to-All Exchange. 


Local Merge. 


t~~ Execution 


Fig. 17 Algorithmic schema of dash: : sort with four processors (P = 4) 
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subranges whether they need to be considered in future iterations and add them to 
the set of active ranges. The vector of splitters follows from Definition 1 (page 126) 
and is the result of a prefix sum over the local capacities of all processors. 

Another difference compared to the original algorithm [28] is that in our case 
we replace the local partition with binary search which is possible due to our 
initial sorting step in all processors. Thus, to determine a local histogram over p 
pieces requires logarithmic computation complexity instead of linear complexity. 
A global histogram to determine if all splitters are valid (or if we need to refine 
the boundaries) requires a single REDUCE over all processors with logarithmic 
communication complexity. 

The question how many iterations we need is answered as follows. We know 
that for the base case with only two processors (i.e., only one splitter) distributed 
selection has a recursive depth of O(log p). This follows from the weighted median 
for the pivot selection which guarantees a reduction of the working set by at least 
one quarter each iteration. As described earlier instead of a single pivot we collect 
multiple pivots in a single iteration which we achieve by a list of active ranges. 
Although the local computation complexity increases by a factor of O(log p) the 
recursion depth does not change. Hence, the overall communication complexity is 
O(log? p) including the REDUCE call each iteration. 

After successfully determining the splitters all processors communicate the 
locally partitioned pieces with an ALL-TO-ALL exchange. Merging all received 
pieces leads to a globally sorted sequence over all processors. Due to the high 
communication volume communication-computation overlap is required to achieve 
good scaling efficiency. However, for collective operations MPI provides a very 
limited interface. While we can use a non-blocking ALL-TO-ALL we cannot 
operate on partially received pieces. For this reason we designed our own ALL- 
TO-ALL algorithm to pipeline communication and merging. Similar to the Butterfly 
algorithm processor i sends to destination (i + r) (mod p) and receives from 
(i — r) (mod p) in round r [33]. However we schedule communication requests 
only as long as a communication buffer of a fixed length is not completely allocated. 
As soon as some communication requests complete we schedule new requests 
while merging the received chunks. PGAS provides additional optimizations. For 
communication within a shared memory node we use a cache-efficient ALL-TO- 
ALL algorithm to minimize negative cache effects among the involved processors. 
Instead of scheduling send receive pairs processor ranks are reordered according 
to a Morton order. Data transfer is performed using one-sided communication 
mechanisms. Similar optimizations can be applied to processor pairs running on 
nearby nodes. We are preparing a paper to describe the involved optimizations in 
more detail. First experimental evaluations reveal that we can achieve up to 22% 
speedup compared to a single ALL-TO-ALL followed by a p-way local merge. 

In the next section we demonstrate our performance scalability against a state- 
of-the-art implementation in Charm++. 
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Fig. 18 Strong scaling study with Charm+-+ and DASH. (a) Median execution time. (b) Strong 
scaling behavior of dash: : sort 


3.2.4 Evaluation and Conclusion 


We conducted the experiments on SuperMUC Phase 2 hosted at the Leibnitz Super- 
computing Center. This system is an island-based computing cluster, each equipped 
with 512 nodes. Each node has two Intel Xeon E5-2697v3 14-core processors with 
a nominal frequency of 2.6 GHZ and 64GB of memory, although only 56GB are 
usable due to the operating system. Computation nodes are interconnected in a non- 
blocking fat tree with Infiniband FDR14 which achieves a peak bisection bandwidth 
of 5.1 TB/s. We compiled our binaries with Intel ICC 18.0.2 and linked the Intel MPI 
2018.2 library for communication. The Charm+-+ implementation was executed 
using the most recent stable release.* On each node we scheduled only 16 MPI ranks 
(28 cores available) because the Charm+-+ implementation requires the number of 
ranks to be a power of two. We emphasize that our implementation in DASH does 
not rely on such constraints. 

The strong scaling performance results are depicted in Fig. 18a. We sort 28 
GBytes of uniformly distributed 64-bit signed integers. This is the maximum 
memory capacity on a single node because our algorithm is not in-place. We always 
report the median time out of 10 executions along with the 95% confidence interval, 
excluding an initial warmup run. For Charm+-+ we can see wider confidence 
intervals. We attribute this to a volatile histogramming phase which we can see after 
analyzing generated log files in the Charm+-+ experiments. Overall, we observe that 
both implementations achieve nearly linear speedup with a low number of cores. 
Starting from 32—64 nodes scalability gets worse. DASH still achieves a scaling 
efficiency of ~0.6 on 3500 cores while Charm+-+ is slightly below. Figure 18b 
visualizes the relative fraction of the most relevant algorithm phases in a single run. 
It clearly identifies histogramming as the bottleneck if we scale up the number of 
processors. This is not surprising because with 128 nodes (2048 ranks) each rank 
operates on only 8 MB of memory. 


4¥6.9.0, http://charmplusplus.org/download/. 
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Fig. 19 Weak scaling study with Charm+-+ and DASH. (a) Weak scaling efficiency. (b) Weak 
scaling behavior of dash: : sort 


Figure 19a depicts the weak scaling efficiency. The absolute median execution 
time for DASH started from 2.3s on one node and ended with 4.6s if we 
scale to 128 nodes (3584 cores). As expected, the largest fraction of time is 
consumed in local sorting and the ALL-TO-ALL data exchange because we have to 
communicate 256 GB across the network. Figure 19b confirms this. The collective 
ALLREDUCE of P — | splitters among all processors in histogramming overhead is 
almost amortized from the data exchange which gives an overall good scalability for 
DASH. The Charm+-+ histogramming algorithm again shows high volatility with 
running times from 5—25 s, resulting in drastic performance degradation. 

Our implementation shows good scalability on parallel machines with a large 
processor count. Compared to other algorithms we do not pose any assumptions on 
the number of ranks, the globally allocated memory volume or the key distribution. 
Performance measurements reveal that our general purpose approach does not 
result in performance degradation compare to other state-of-the-art algorithms. Our 
optimized MPI ALL-TO-ALL exchange with advanced PGAS techniques shows 
how we can significantly improve communication-computation overlap. Finally, the 
STL compliant interface enables programmers to easily integrate a scalable sorting 
algorithm into scientific implementations. 


4 Use Cases and Applications 


4.1 A Productivity Study: The Cowichan Benchmarks 


In this section we present an evaluation of DASH focusing on productivity and 
performance by comparison with four established parallel programming approaches 
(Go, Chapel, Cilk, TBB) using the Cowichan set of benchmark kernels. 
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4.1.1 The Cowichan Problems 


The Cowichan problems [36], named after a tribal area in the Canadian Northwest, 
is a set of small benchmark kernels that have been developed primarily for the 
purpose of assessing the usability of parallel programming systems. There are two 
versions of the Cowichan problems and here we restrict ourselves to a subset of the 
problems found in the second set. The comparison presented in this section is based 
on previous work by Nanz et al. [23] as we use their publicly available code> to 
compare with our own implementation of the Cowichan benchmarks using DASH. 
The code developed as part of a study by Nanz et al. has been created by expert 
programmers in Go, Chapel, Cilk and TBB and can thus be regarded as idiomatic 
for each approach and free of obvious performance defects. 
The five (plus one) problems we consider in our comparison are the following: 


randmat: Generate a(nrows x ncols) matrix mat of random integers in the 
range 0,...,max — 1 using a deterministic pseudo-random number generator 
(PRNG). 

thresh: Given an integer matrix mat, and a thresholding percentage p, compute 
a boolean matrix mask of similar size, such that mask selects p percent of the 
largest values of mat. 

winnow: Given an integer matrix mat, a boolean matrix mask, and a desired 
number of target elements nelem, perform a weighted point selection operation 
using sorting and selection. 

outer: Given a vector of nelem (row, col) points, compute an (nelem x 
nelem) outer product matrix omat and a vector vec of floating point values 
based on the Euclidean distance between the points and the origin, respectively. 

matvec: Given annelem x nelem matrix mat and a vector vec, compute the 
matrix-vector product (row-by-row inner product) res. 

chain: Combine the kernels in a sequence such that the output of one becomes 
the input for the next. I.e., chain = randmat o thresh o winnow o outer o matvec. 


4.1.2 The Parallel Programming Approaches Compared 


We compare our implementation of the Cowichan problems with existing solutions 
in the following four programming approaches. 

Chapel [8] is an object-oriented partitioned global address space (PGAS) 
programming language developed since the early 2000s by Cray, originally as part 
of DARPA’s High Productivity Computing Systems (HPCS) program. We have used 
Chapel version 1.15.0 in our experiments. 

Go [10] is a general-purpose systems-level programming language developed 
at Google in the late 2000s that focuses on concurrency as a first-class concern. 
Go supports lightweight threads called goroutines which are invoked by prefixing 


Shttps://bitbucket.org/nanzs/multicore-languages/src. 
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a function call with the go keyword. Channels provide the idiomatic way for 
communication between goroutines but since all goroutines share a single address 
space, pointers can also be used for data sharing. We have used Go version 1.8 in 
our experiments. 

Cilk [6] started as an academic project at MIT in the 1990s. Since the late 2000s 
the technology has been extended and integrated as Cilk Plus into the commercial 
compiler offerings from Intel and more recently open source implementations for 
the GNU Compiler Collection (GCC) and LLVM became available. Cilk’s initial 
focus was on lightweight tasks invoked using the spawn keyword and dynamic 
workstealing. Later a parallel loop construct (cilk_ for) was added. We have used 
Cilk as integrated in Intel C/C++ compilers version 18.0.2. 

Intel Threading Building Blocks (TBB) [25] is a C++ template library for 
parallel programming that provides tasks, parallel algorithms and containers using 
a work-stealing approach that was inspired by the early work on Cilk. We have used 
TBB version 2018.0 in our experiments, which is part of Intel Parallel Studio XE 
2018. 


4.1.3 Implementation Challenges and DASH Features Used 


In this section we briefly describe the challenges encountered when implementing 
the Cowichan problems, a more detailed discussion can be found in a recent 
publication [13]. Naturally this small set of benchmarks only exercises a limited 
set of the features offered by either programming approach. However, we believe 
that the requirements embedded in the Cowichan problems are relevant to a wide 
set of other uses cases, including the classic HPC application areas. 


Memory Allocation and Data Structure Instantiation The Cowichan problems 
use one- and two-dimensional arrays as the main data structures. 1D arrays are 
widely supported by all programming systems. True multidimensional arrays, how- 
ever, are not universally available and as a consequence workarounds are commonly 
used. The Cilk and TBB implementation both adopt a linearized representation of 
the 2D matrix and use a single malloc call to allocate the whole matrix. Element- 
wise access is performed by explicitly computing the offset of the element in the 
linearized representation by mat[i x ncols + j]. Go uses a similar approach but 
bundles the dimensions together with the allocated memory in a custom type. In 
contrast, Chapel and DASH support a concise and elegant syntax for the allocation 
and direct element-wise access of their built-in multidimensional arrays. In the case 
of DASH, the distributed multidimensional array is realized as a C++ template 
class that follows the container concept of the standard template library (STL) [11]. 


Work Sharing In all benchmarks, work has to be distributed between multiple 
processes or threads, for example when computing the random values in randmat in 
parallel. randmat requires that the result be independent of the degree of parallelism 
used and all implementations solve this issue by using a separate deterministic seed 
value for each row of the matrix. A whole matrix row is the unit of work that is 
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distributed among the processes or threads. The same strategy is also used for outer 
and product. 

Cilk uses cilk_for to automatically partition the matrix rows and TBB uses 
C++ template mechanisms to achieve a similar goal. Go does not offer built- 
in constructs for simple work sharing and the functionality has to be laboriously 
created manually using goroutines, channels, and ranges. 

In Chapel this type of work distribution can simply be expressed as a parallel 
loop (forall). 

In DASH, the work distribution follows the data distribution. I.e., each unit is 
responsible for computing on the data that is locally resident, the owner computes 
model. Each unit determines its locally stored portion of the matrix (guaranteed to 
be a set of rows by the data distribution pattern used) and works on it independently. 


Global Max Reduction In thresh, the largest matrix entry has to be determined to 
initialize other data structures to their correct size. The Go reference implementation 
doesn’t actually perform this step and instead just uses a default size of 100, Go is 
thus not discussed further in this section. 

In Cilk a reducer_max object together with a parallel loop over the rows is 
employed to find the maximum. Local maximal values are computed in parallel 
and then the global maximum is found using the reducer object. In TBB a similar 
construct is used (tbb: :parallel_ reduce). In these approaches finding the 
local maximum and computing the global maximum are separate steps that require 
a considerable amount of code (several 10s of lines of code). 

Chapel again has the most concise syntax of all approaches, the maximum value 
is found simply by nmax = max reduce matrix. The code in the DASH 
solution is nearly as compact, by using the max_element () algorithm to find 
the maximum. Instead of specifying the matrix object directly, in DASH we have 
to couple the algorithm and the container using the iterator interface by passing 
mat .begin() and mat .end () to denote the range of elements to be processed. 


Parallel Histogramming thresh requires the computation of a global cumulative 
histogram over an integer matrix. Thus, for each integer value 0, ..., nmax — 1 we 
need to determine the number of occurrences in the given matrix in parallel. The 
strategy used by all implementations is to compute one or multiple histograms by 
each thread in parallel and to later combine them into a single global histogram. 

In DASH we use a distributed array to compute the histogram. First, each 
unit computes the histogram for the locally stored data, by simply iterating over 
all local matrix elements and updating the local histogram (histo.local). 
Then dash: : transform is used to combine the local histograms into a sin- 
gle global histogram located at unit 0. dash: :transform is modeled after 
std::transform, a mutating sequence algorithm. Like the STL variant, the 
algorithm works with two input ranges that are combined using the specified 
operation into an output range. 


Parallel Sorting winnow requires the sorting of 3-tuples using a custom compar- 
ison operator. Cilk and TBB use a parallel shared memory sort. Go and Chapel 
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call their respective internal sort implementations (quicksort in the case of Chapel) 
which appears to be unparallelized. DASH can take advantage of a parallel and 
distributed sort implementation based on histogram sort cf. 3.2. 


4.1.4 Evaluation 


We first compare the productivity of DASH compared to the other parallel program- 
ming approaches before analyzing the performance differences. 


Productivity We evaluate programmer productivity by analyzing source code 
complexity. Table 1 shows the lines of code (LOC) used in the implementation 
for each kernel, counting only lines that are not empty or comments. Of course, 
LOC is a crude approximation for source code complexity but few other metrics 
are universally accepted or available for different programming languages. LOC 
at least gives a rough idea for source code size, and, as a proxy, development 
time, likelihood for programming errors and productivity. The overall winners in 
the productivity category are Chapel and Cilk, which achieve the smallest source 
code size for three benchmark kernels. For most kernels, DASH also achieves a 
remarkably small source code size considering that the same source code can run 
on shared memory as well as on distributed memory machines. 


Performance As the hardware platform for our experiments we have used one 
or more nodes of SuperMUC Phase 2 (SuperMUC-HW) with Intel Xeon E5-2697 
(Haswell) CPUs with 2.6 GHz, 28 cores and 64 GB of main memory per node. 


Single Node Comparison We first investigate the performance differences between 
DASH and the four established parallel programming models on a single Haswell 
node. We select a problem size of nrows = ncols = 30,000 because this is the largest 
size that successfully ran with all programming approaches. 


Table 2 shows the absolute performance (runtime in seconds) and relative 
runtime (compared to DASH) when using all 28 cores of a single node. Evidently, 
DASH is the fastest implementation, followed by Cilk and TBB. Chapel and 
especially Go can not deliver competitive performance in this setting. 

Analyzing the scaling behavior in more detail (not shown graphically due to 
space restrictions) by increasing the number of cores used on the system from 1 
to 28 reveals a few interesting trends. For outer and randmat, DASH, Cilk and TBB 


Table 1 Lines-of-code DASH |Go | Chapel | TBB | Cilk 
(LOC) measure for each = 

kernel and programming Randmat | 18 29 | 14 15 12 
approach, counting Thresh 31 63 | 30 56 52 
non-empty and non-comment Winnow | 67 94 |31 74 78 
lines only Outer |23 38 | 15 19 |15 


Product 19 27 |11 14 10 
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Table 2 Performance comparison for each kernel and programming approach using all cores on 
one node of SuperMUC-HW 


Absolute runtime (sec.) Relative runtime 

DASH | Go Chapel | TBB | Cilk | DASH | Go Chapel | TBB | Cilk 
Randmat | 0.12 5.84 0.36 | 0.15 | 0.19 | 1.00 48.69 | 3.03 1.26 | 1.58 
Thresh 0.20 0.64 0.53 | 0.41 | 0.40 | 1.00 3.19 | 2.64 | 2.06 | 2.00 
Winnow | 2.99 366.13 | 256.40 | 9.45 | 4.28 | 1.00 122.45 |85.75  |3.16 | 1.43 
Outer 0.25 1.18 0.39 | 0.27 |0.31 | 1.00 4.70 | 1.56 | 1.06 | 1.24 
Product | 0.06 0.46 0.19 | 0.12 |0.13 | 1.00 7.66 | 3.15 |2.01 | 2.16 


behave nearly identical in terms of absolute performance and scaling behavior. The 
small performance advantage of DASH can be attributed to better NUMA locality of 
DASH, where all work is done on process-local data. For thresh and winnow DASH 
can take advantage of optimized parallel algorithms (for global max reduction and 
sorting, respectively) whereas these operations are performed sequentially in some 
of the other approaches. For product the DASH implementation takes advantage of 
a local copy optimization to improve data locality. The scaling study reveals that 
this optimization at first costs performance but pays off at larger core counts. 


Multinode Scaling We next investigate the scaling of the DASH implementation on 
up to 16 nodes (448 total cores) of SuperMUC-HW. None of the other approaches 
can be compared with DASH in this scenario. Cilk and TBB are naturally restricted 
to shared memory systems by their threading-based nature. Go realizes the CSP 
(communicating sequential processes) model that would, in principle, allow for a 
distributed memory implementation but since data sharing via pointers is allowed, 
Go is also restricted to a single shared memory node. Finally, Chapel targets both 
shared and distributed memory systems, but the implementation of the Cowichan 
problems available in this study is not prepared to be used with multiple locales 
and cannot make use of multiple nodes (it lacks the dmapped specification for data 
distribution). 

The scaling results are shown in Fig.20 for two data set sizes. In Fig. 20 
(left) we show the speedup relative to one node for a small problem size 
(nrows = ncols = 30,000) and in Fig. 20 (right) we show the speedup of a larger 
problem size (nrows = ncols = 80,000) relative to two nodes, since this problem is 
too big to fit into the memory of a single node. 

Evidently for the smaller problem size, the benchmark implementations reach 
their scaling limit at about 10 nodes, whereas the larger problem sizes manage to 
scale well even to 16 nodes, with the exception of the product benchmark which 
shows the worst scaling behavior. This behavior can be explained by the relatively 
large communication requirement of the product benchmark kernel. 
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Fig. 20 Scaling behavior of the Cowichan benchmarks with up to 16 nodes on SuperMUC-HW. 
(a) Multinode Scaling, 30 x 30k Matrix. (b) Multinode Scaling: 80k x 80k Matrix 


4.1.5 Summary 


In this section we have evaluated DASH, a new realization of the PGAS approach 
in the form of a C++ template library by comparing our implementation of the 
Cowichan problems with those developed by expert programmers in Cilk, TBB, 
Chapel, and Go. We were able to show that DASH achieves both remarkable 
performance and productivity that is comparable with established shared memory 
programming approaches. DASH is also the only approach in our study where the 
same source code can be used both on shared memory systems and on scalable 
distributed memory systems. This step, from shared memory to distributed memory 
systems is often the most difficult for parallel programmers because it frequently 
goes hand in hand with a re-structuring of the entire data distribution layout of the 
application. With DASH the same application can seamlessly scale from a single 
shared memory node to multiple interconnecting nodes. 


4.2 Task-Based Application Study: LULESH 


The Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics 
(LULESH) is part of the Department of Energy’s Coral proxy application 
benchmark suite [19]. The domain typically scales with the number of processes and 
is divided into a grid of nodes, elements, and regions of elements. The distribution 
and data exchange follows a 27-point stencil with three communication steps per 
timestep. 

The reference implementation of LULESH uses MPI send/recv communication 
to facilitate the boundary exchange between neighboring processes and OpenMP 
worksharing constructs are used for shared memory parallelization. Instead, we 
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Fig. 21 Lulesh reference implementation and Lulesh using DASH tasks running on a Cray XC40 


iteratively ported LULESH to use DASH distributed data structures for neighbor 
communication and DASH tasks for shared memory parallelization, with the ability 
to partially overlap communication and computation. 

The first step was to adapt the communication to use DASH distributed data 
structures instead of MPI [12]. In order to gradually introduce tasks, we started 
by porting the OpenMP worksharing loop constructs with to use the DASH task- 
loop construct discussed in Sect. 2.1.3. In a second step, the iteration chunks of the 
taskloops were connected through dependencies, allowing a breadth-first scheduler 
to execute tasks from different task-loop statements concurrently. In a last step, a 
set of high-level tasks has been introduced that encapsulate the task-loop statements 
and coordinate the computation and communication tasks. 

The resulting performance at scale on a Cray XC40 is shown in Fig. 21. For 
larger problem sizes (s = 300? elements per node), the speedup at scale of the 
DASH port over the reference implementation is about 25%. For smaller problem 
sizes (s = 200°), the speedup is significantly smaller at about 5%. We believe that 
further optimizations in the tasking scheduler may yield improvements even for 
smaller problem sizes. 


5 Outlook and Conclusion 


We have presented an overview of our parallel programming approach DASH, 
focusing on recent activities in the areas of support for task-based execution, 
dynamic locality, parallel algorithms, and smart data structures. Our results show 
that DASH offers a productive programming environment that also allows program- 
mers to write highly efficient and scalable programs with performance on-par or 
exceeding solutions relying on established programming systems. 
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Work on DASH is not finished. The hardware landscape in high performance 
computing is getting still more complex, while the application areas are getting 
more diverse. Heterogeneous compute resources are common, nonvolatile memories 
are making their appearance and domain specific architectures are on the horizon. 
These and other challenges must be addressed by DASH to be a viable parallel 
programming approach for many users. 

The challenge of utilizing heterogeneous computing resources, primarily in the 
form of graphics processing units (GPUs) used in high performance computing 
systems, is addressed in a project building on DASH funded by the German Federal 
Ministry of Education and Research (BMBF) called MEPHISTO. We close our 
report on DASH with a short discussion of MEPHISTO. 


5.1 MEPHISTO 


The PGAS model simplifies the implementation of programs for distributed memory 
plattforms, but data locality plays an important role for performance critical 
applications. The owner-computes paradigm aims to maximize the performance by 
minimizing the intra-node data movement. This focus on locality also encourages 
the usage of shared memory parallelism. During the DASH project the partners 
already experimented with shared memory parallelism and how it can be integrated 
into the DASH ecosystem. OpenMP was integrated into suitable algorithms, 
experiments with Intel’s Thread Building Blocks as well as conventional POSIX- 
thread-based parallelism were conducted. These, however, had to be fixed for a 
given algorithm: a user had no control over the acceleration and the library authors 
had to find sensible configurations (e.g. level of parallelism, striding, scheduling). 
Giving users of the DASH library more possibilities and flexibility is a focus of the 
MEPHISTO project. 

The MEPHISTO project partly builds on the work that has been done in DASH. 
One of its goal is to integrate abstractions for better data locality and heterogeneous 
programming with DASH. Within the scope of MEPHISTO two further projects are 
being integrated with DASH: 


e Abstraction Library for Parallel Kernel Acceleration (ALPAKA) [22] 
* Low Level Abstraction of Memory Access? (LLAMA) 


ALPAKA is a library that adds abstractions for node-level parallelism for C++ 
programs. Once a kernel is written with ALPAKA, it can be easily ported to 
different accelerators and setups. For example a kernel written with ALPAKA can 
be executed in a multi-threaded CPU environment as well as on a GPU, for example 
using the CUDA backend. ALPAKA provides optimized code for each accelerator 
that can be further customized by developers. To switch from one back end to 


Shttps://github.com/mephisto-hpc/llama. 
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another requires just the change of one type definition in the code so that the code 
is portable. 

Integrating ALPAKA and DASH brings flexibility for node-level parallelism 
within a PGAS environment: switching an algorithm from a multi threaded CPU 
implementation to an accelerator now only requires passing one more param- 
eter to the function call. It also gives the developer an interface to control 
how and where the computation should happen. The kernels of algorithms like 
dash: :transform_reduce are currently extended to work with external 
executors like ALPAKA. Listing 5 shows how executors are used to offload 
computation using ALPAKA. Note that the interface allows offloading to other 
implementations as well. In the future DASH will support any accelerator that 
implements a simple standard interface for node-level parallelism. 


| policy.executor() .bulk_twoway execute ( 
[=i (siza © block_index, 
Bae E element_index, 
Tx res, 
value_type block first) { 
res [block_index] = binary_op ( 
res [block_index], 
unary_op (block first [element_index] ) 
=> Ia 


9 }, 

10 in first, // a "shape" 

11 [&] () -> std::vector<std::future<T>>& { 
2 return results; 

13 Y 


std::ignore); // shared state (unused) 


Listing 5 Offloading using an executor inside dash: :transform_reduce. The executor can 
be provided by MEPHISTO 


LLAMA on the other hand focuses solely on the format of data in memory. 
DASH already provides a flexible Pattern Concept to define the data placement for 
a distributed container. However, LLAMA gives developers finer grained control 
over the data layout. DASH’s patterns map elements of a container to locations in 
memory, but the layout of the elements itself is fixed. With LLAMA, developers 
can specify the data layout with a C++ Domain Specific Language (DSL) to 
fit the application’s needs. A typical example is a conversion from Structure of 
Arrays (SoA) to Array of Structures (AoS) and vice versa. But also more complex 
transformations like projections are being evaluated. 

Additionally to the integration of LLAMA, a more flexible, hierarchical and 
context-sensitive pattern concept is being evaluated. Since the current patterns map 
elements to memory locations in terms of units (i.e., MPI processes), using other 
sources for parallelism can be a complex task. Mapping elements to (possibly 
multiple) accelerators was not easily possible. Local patterns extend the existing 
patterns. By matching elements with entities (e.g. a GPU), the node-local data may 
be assigned to other compute units beside processes. 
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Both projects are currently evaluated in the context of DASH to explore how the 
PGAS programming model can be used more flexibly and efficiently. 
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Abstract The ESSEX project has investigated programming concepts, data struc- 
tures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue 
solvers on future heterogeneous exascale systems. Starting without the burden 
of legacy code, a holistic performance engineering process could be deployed 
across the traditional software layers to identify efficient implementations and guide 
sustainable software development. At the basic building blocks level, a flexible 
MPI+X programming approach was implemented together with a new sparse data 
structure (SELL-C-c) to support heterogeneous architectures by design. Further- 
more, ESSEX focused on hardware-efficient kernels for all relevant architectures 
and efficient data structures for block vector formulations of the eigensolvers. 
The algorithm layer addressed standard, generalized, and nonlinear eigenvalue 
problems and provided some widely usable solver implementations including a 
block Jacobi—Davidson algorithm, contour-based integration schemes, and filter 
polynomial approaches. Adding to the highly efficient kernel implementations, 
algorithmic advances such as adaptive precision, optimized filtering coefficients, 
and preconditioning have further improved time to solution. These developments 
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were guided by quantum physics applications, especially from the field of topo- 
logical insulator- or graphene-based systems. For these, ScaMaC, a scalable matrix 
generation framework for a broad set of quantum physics problems, was developed. 
As the central software core of ESSEX, the PHIST library for sparse systems 
of linear equations and eigenvalue problems has been established. It abstracts 
algorithmic developments from low-level optimization. Finally, central ESSEX 
software components and solvers have demonstrated scalability and hardware 
efficiency on up to 256 K cores using million-way process/thread-level parallelism. 


1 Introduction 


The efficient solution of linear systems or eigenvalue problems involving large 
sparse matrices has been an active research field in parallel and high performance 
computing for many decades. Software packages like Trilinos [33] or PETSc [9] 
have been developed to great maturity, and algorithmic improvements were accom- 
panied by advances in programming abstractions addressing, e.g., node-level 
heterogeneity (cf. Kokkos [19]). Completely new developments such as Ginkgo! 
are rare and do not focus on large-scale applications or node-level efficiency. 

Despite projections from the late 2000s, hardware architectures have not devel- 
oped away from traditional clustered multicore systems. However, a clear trend of 
increased node-level parallelism and heterogeneity has been observed. Although 
several new architectures entered the field (and some vanished again), the basic 
concepts of core-level code execution and data parallelism have not changed. This 
is why the MPI+X concept is still a viable response to the challenge of hardware 
diversity. 

Performance analysis of highly parallel code typically concentrated on scalabil- 
ity, but provably optimal node-level performance was rarely an issue. Moreover, 
strong abstraction boundaries between linear algebra building blocks, solvers, and 
applications made it hard to get a holistic view on a minimization of time to solution, 
encompassing optimizations in the algorithmic and implementation dimensions. 

In this setting, the ESSEX project took the opportunity to start from a clean 
slate, deliberately breaking said abstraction boundaries to investigate performance 
bottlenecks together with algorithmic improvements from the core to the highly 
parallel level. Driven by the targeted application fields, bespoke solutions were 
developed for selected algorithms and applications. The experience gained in the 
development process will lead the way towards more generic approaches rather 
than compete with established libraries in terms of generality. The overarching motif 
was a consistent performance engineering process that coordinated all performance- 
relevant activities across the different software layers [1, 3, 4, 20, 52-54, 56, 62, 64]. 


'https://github.com/ginkgo-project/ginkgo. 
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Consequently, the ESSEX parallel building blocks layer implemented in the 
GHOST library [55] supports MPI+X, with X being a combination of node- 
level programming models able to fully exploit hardware heterogeneity, functional 
parallelism, and data parallelism. Despite fluctuations in hardware architectures and 
new programming models hitting the market every year, OpenMP or CUDA is still 
the most promising and probably most sustainable choice for X, and ESSEX-IIT 
adhered to it. In addition, engineering highly specialized kernels including sparse- 
matrix multiple-vector operations and appropriate data structures for all relevant 
compute architectures provided the foundation for hardware- and energy-efficient 
large-scale computations. 

Building on these high-performance building blocks, one focus of the algorithm 
layer was put on the block formulation of Jacobi—Davidson [64] and filter diago- 
nalization [56] methods, the hardware efficiency of preconditioners [46—48], and 
the development of hardware-aware coloring schemes [1]. In terms of scalability, 
the project has investigated new contour-based integration eigensolvers [23, 24] 
that can exploit additional parallelism layers beyond the usual data parallelism. 
The solvers developed in ESSEX can tackle standard, generalized, and nonlinear 
eigenvalue problems and may also be used to extract large bulks of extremal and 
inner eigenvalues. 

The applications layer applies the algorithms and building blocks to deliver 
scalable solutions for topical quantum materials like graphene, topological insu- 
lators, or superconductors, and nonlinear dynamical systems like reaction-diffusion 
systems. A key issue for large-scale simulations is the scalable (in terms of size and 
parallelism) generation of the sparse matrix representing the model Hamiltonian. 
Our matrix generation framework ScaMaC can be integrated into application code to 
allow the on-the-fly, in-place construction of the sparse matrix. Beyond the ESSEX 
application fields, matrices from many other relevant areas can be produced by 
ScaMaC. 

The PHIST library [78] is the sustainable outcome of the performance-centric 
efforts in ESSEX. It is built on a rigorous software and performance engineering 
process, comprises diverse solver components, and supports multiple backends (e.g., 
Trilinos, PETSc, ESSEX kernels). It also interfaces to multiple languages such as 
C, C++, Fortran 2003, and Python. The CRAFT library [73] provides user-friendly 
access to fault tolerance via checkpoint/restart and automatic recovery for iterative 
codes using standard C++. 

Scalabilty, performance, and portability have been tested on three top-10 super- 
computers covering the full range of architecures available during the ESSEX 
project time frame: Piz Daint? (heterogeneous CPU-GPU), OakForest-PACS* 
(many-core), and SuperMUC-N G* (standard multi-core). 


*https://www.cscs.ch/computers/piz-daint/. 
3https://www.cc.u-tokyo.ac.jp/en/supercomputer/ofp/service/. 
+https://doku.lrz.de/display/PUBLIC/SuperMUC-NG. 
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This review focuses on important developments in ESSEX-II. After presenting a 
brief overview of the most relevant achievements in the first project phase ESSEX-I 
in Sects. 2 and 3 details algorithmic developments in ESSEX-II, notably with respect 
to preconditioners and projection-based methods for obtaining inner eigenvalues. 
Moreover, we present the RACE (Recursive Algebraic Coloring Engine) method, 
which delivers hardware-efficient graph colorings for parallelization of algorithms 
and kernels with data dependencies. In Sect.4 we showcase performance and 
parallel efficiency numbers for library components developed in ESSEX-II that are 
of paramount importance for the application work packages: GPGPU-based tall and 
skinny matrix-matrix multiplication and the computation of inner eigenvalues using 
polynomial filter techniques. Section 5 describes the software packages that were 
developed to a usable and sustainable state, together with their areas of applicability. 
In Sect. 6 we show application results from the important areas of quantum physics 
and nonlinear dynamical systems. Finally, in Sect. 7 we highlight the collaborations 
sparked and supported by SPPEXA through the ESSEX-II project. 


2 Summary of the ESSEX-I Software Structure 


The Exascale-enabled Sparse Solver Repository (ESSR) was developed along the 
requirements of the algorithms and applications under investigation in ESSEX. It 
was not intended as a full-fledged replacement of existing libraries like Trilinos> 
[33], but rather as a toolbox that can supply developers with blueprints as a starting 
point for their own developments. In ESSEX-I, the foundations for a sustainable 
software framework were laid. See Sect. 5 for developments in ESSEX-II. 

The initial version of the ESSR [77] comprised four components: 


e GHOST (General, Hybrid and Optimized Sparse Toolkit) [55], a library of basic 
sparse and dense linear algebra building blocks that are not available in this form 
in other software packages. The development of GHOST was strictly guided by 
performance engineering techniques; implementations of standard kernels such as 
sparse matrix-vector multiplication (SpMVM) and sparse matrix-multiple-vector 
multiplication (SpMMVM) as well as tailor-made fused kernels, for instance 
those employed in the Kernel Polynomial Method (KPM) [81], were modeled 
using the roofline model. GHOST supports, by design, strongly heterogeneous 
environments using the MPI+X approach. See [51] for a comprehensive overview 
of GHOST and its building blocks. 

ESSEX-Physics, a collection of prototype implementations of polynomial eigen- 
solvers such as the KPM and Chebyshev Filter Diagonalization (ChebFD). These 
were implemented on top of GHOST using tailored kernels and were shown to 
perform well on heterogeneous CPU-GPU systems [53]. 


Shttps://trilinos.org/. 
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e PHIST (Pipelined Hybrid-parallel Iterative Solver Toolkit), which comprises 
Jacobi—Davidson type eigensolvers and Krylov methods for linear systems. One 
important component is a test framework that allows for continuous integration 
(CI) throughout the development cycle. PHIST can not only use plain GHOST as 
its basic linear algebra layer; it is also equipped with fallback kernel implemen- 
tations and adapters for the Trilinos and Anasazi libraries. A major achievement 
in the development of PHIST was an efficient block Jacobi—Davidson eigenvalue 
solver, which could be shown to have significant performance advantages over 
nonblocked versions when using optimized building blocks from GHOST [64]. 

e BEAST (Beyond fEAST), which implements innovative projection-based eigen- 
solvers motivated by the contour integration-based FEAST method [23]. The 
ESSEX-I project has contributed to improving FEAST in two ways: by proposing 
techniques for solving or avoiding the linear systems that arise, and by improving 
robustness and performance of the algorithmic scheme. 


A pivotal choice for any sparse algorithm implementation is the sparse matrix 
storage format. In order to avoid data conversion and the need to support mul- 
tiple hardware-specific formats in a single code, we developed the SELL-C- 
o format [52]. It shows competitive performance on the dominating processor 
architectures in HPC: standard multicore server CPUs with short-vector single 
instruction multiple data (SIMD) capabilities, general-purpose graphics processing 
units (GPGPUs), and many-core designs with rather weak cores such as the Intel 
Xeon Phi. SELL-C-o circumvents the performance penalties of matrices with few 
nonzero entries per row on architectures on which SIMD vectorization is a key 
element for performance even with memory-bound workloads. 

In order to convert a sparse matrix to SELL-C-o, its rows are first sorted 
according to their respective numbers of nonzeros. This sorting is performed across 
row blocks of length o. After that, the matrix is cut into row blocks of length C. 
Within each block, rows are padded with zeros to equal length and then stored 
in column-major order. See Fig. 1 for visualizations of SELL-C-o with C = 6 
and ø € {1, 12, 24}. Incidentally, known and popular formats can be recovered as 
corner cases: SELL-1-1 is the well-known CSR storage format and SELL-N-1 is 
ELLPACK. The particular choice of C and o influences the performance of the 
spMVM operation; optimal values are typically matrix- and hardware-dependent. 
However, in practice one can usually find parameters that yield good performance 
across architectures for a particular matrix. A roofline performance model was 
constructed in [52] that sets an upper limit for the spMVM performance for any 
combination of matrix and architecture. This way, “bad” performance is easily 
identified. SELL-C-o was quickly adopted by the community and is in use, in pure 
or adapted form, in many performance-oriented projects [5, 6, 28, 60, 80]. 
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Fig. 1 Variants of the SELL-C-o storage format. Arrows indicate the storage order of matrix 
values and column indices. Image from [52]. (a) SELL-6-1, 8 = 0.51. (b) SELL-6-12, 6 = 0.66. 
(c) SELL-6-24, B = 0.84 


3 Algorithmic Developments 


In this section we describe selected developments within ESSEX-II on the algo- 
rithmic level, in particular preconditioners for the solution of linear systems that 
occur in the eigensolvers, a versatile framework for computing inner eigenvalues, 
and a nonlinear eigensolver. We also cover a systematic comparison of contour- 
based methods. We close the section with the introduction of RACE, which is an 
algorithmic development for graph coloring guided by the constraints of hardware 
efficiency. 


3.1 Preconditioners (ppOpen-SOL) 


Two kinds of solvers have been developed: a preconditioner targeting the ill- 
conditioned large scale problems arising in the BEAST-C method (cf. Sect. 3.2) and 
a multigrid solver targeting problems arising from finite difference discretizations 
of partial differential equations (PDEs). 
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3.1.1 Regularization 


The BEAST-C method leads to a large number of ill-conditioned linear systems with 
complex diagonal shifts [24]. Furthermore, in many of our quantum physics applica- 
tions, the system matrices have small (and sometimes random) diagonal elements. In 
order to apply a classic incomplete Cholesky (IC) factorization preconditioner, we 
used two types of regularization to achieve robustness: a blocking technique (BIC) 
and an additional diagonal shift [47]. Using this approach, we solved a set of 120 
prototypical linear systems from this context (e.g., BEAST-C applied to quantum 
physics applications). Due to the complex shift, the system matrix is symmetric but 
not Hermitian. Hence we use an adaptation of the Conjugate Gradient (CG) method 
for complex symmetric matrices called COCG (conjugate orthogonal conjugate 
gradient [79]). 

The blocking technique is a well-known approach for improving the convergence 
rate. In this study, we apply the technique not only for better convergence but also 
for more robustness. The diagonal entries in the target equations are small. By 
applying the blocking technique, the diagonal blocks to be inverted include larger 
off-diagonal entries. 

The diagonal shifting is a direct measure for transforming the ill-conditioned 
matrices to be more diagonally dominant before performing the incomplete fac- 
torization. On the other hand, this may deteriorate the convergence of the overall 
method. We therefore investigate the best value for the diagonal shifting for our 
applications. 

Figure 2 shows the effect of the regularized IC preconditioner with the COCG 
method. By using the diagonal shifted block IC-COCG (BIC-COCG), we solve all 
target linear systems. 
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Fig. 2 Effect of the regularized IC preconditioner with the COCG method. By using the diagonal 
shifted block IC-COCG (BIC-COCG), we can solve all test problems from our benchmark set 
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3.1.2 Hierarchical Parallel Reordering 


In this section, we present scalability results for the BIC preconditioner parallelized 
by a hierarchical parallel graph coloring algorithm. This approach yields an almost 
constant convergence rate with respect to the number of compute nodes, and good 
parallel performance. 

Node-wise multi-coloring (with domain decomposition between nodes) is widely 
used for parallelizing IC preconditioners on clusters of shared memory CPUs. Such 
“localized” multi-coloring leads to a loss of robustness of the regularized IC-COCG 
method, and the convergence rate decreases at high levels of parallelism. To solve 
this problem, we parallelize the block IC preconditioner for the hybrid-parallel 
cluster system. In addition, we proposed the hierarchical parallelization for the 
multi-coloring algorithms [46]. This versatile scheme allows us to parallelize almost 
any multi-coloring algorithm. 

Figure 3 shows the number of iterations and computational time of the BIC- 
COCG method on the Oakleaf-FX cluster, using up to 4,800 nodes. The benchmark 
matrix is the Hamiltonian of a graphene sheet simulation with more than 500 million 
linear equations, for which interior eigenvalues are of interest [24]. Hierarchical 
parallelization yields almost constant convergence with respect to the number of 
nodes. The computational time with 4,600 nodes is 30 times smaller than with 128 
nodes, amounting to a parallel efficiency of 83.5% if the 128-node case is taken as 
the baseline. 


3.1.3 Multiplicative Schwarz-Type Block Red-Black Gau}-Seidel 
Smoother 


Multigrid methods are among the most useful preconditioners for elliptic PDEs. 
In [45] we proposed a multiplicative Schwarz block red/black Gaufi—Seidel (MS- 
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Fig. 3 Computational time and convergence of BIC-COCG for a graphene benchmark problem 
(strong scaling) 
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Post-smoothing 


1 3 4 5 6 F 8 9 10 
[i Bari 22578 22074 23978 2217 2297) 2387) 24874 2507 2727 
[2 | 22574 20178 2117) 18278 1.9076 1.9978 2087) 21774 22878 236/8 
[a [2027] 20877 18274 18178 1.9876 20878 21278 2.1774 228/8 238/8 
[a [2067 18278 187/4 19078 167/5, 1.7473 182/3 189/4 196/3 203/5 
pre- [6 | 21679 18078 197/6 1.6679 1.7273 17478 1.8474 195/5 2.0079 209/3 
smoothing] 6 | 225/4 19778 1.6873 1.8673 1:7975, 185/3 18473 1.985/3 208/9 216/5 
[e [21074 21576 1.8273 18875 1.9475 20173 20875 21779 22079 200] V 
[o [22274 21576 18273 19675 20275 20873 21775 21773 23273 2417/5 BA 
[10| 23174 23678 20174 20473 2.1075, 2.1874 22473 2.3374 240/3 247/5 


Fig. 4 Computational time and number of iterations of a geometric multigrid solver with the MS- 
BRB-GS(q@) smoother 


fast 


BRB-GS) smoother for geometric multigrid methods. It is a modified version of the 
block red-black Gau8B—Seidel (BRB-GS) smoother that improves convergence rate 
and data locality by applying multiple consecutive Gaufi—Seidel sweeps on each 
block. 

The unknowns are divided into blocks so that the amount of data for processing 
each block fits into the cache, and a Gaufi—Seidel iterations are applied to the block 
per smoother step. The computational cost for the additional iterations is much lower 
than for the first iteration because of data locality. 

Figure 4 shows the effect of the MS-BRB-GS(q@) smoother on a single node of 
the ITO system (Intel Xeon Gold 6154 (Skylake-SP) Cluster at Kyushu University). 
By increasing the number of both pre- and post-smoothing steps, the number of 
iterations is decreased. In the best case, MS-BRB-GS is 1.64 faster than BRB-GS. 


3.2 The BEAST Framework for Interior Definite Generalized 
Eigenproblems 


The BEAST framework targets the solution of interior definite eigenproblems 
AX = BXA, 


i.e., for finding all eigenvectors and eigenvalues of a definite matrix pair (A, B), 
with A and B Hermitian and B additionally positive definite, within a given interval 
[A, 4]. The framework is based on the Rayleigh—Ritz subspace iteration procedure, 
in particular the spectral filtering approach: Arbitrary continuous portions of the 
spectrum may be selected for computation with appropriate filtering functions that 
are applied via an implicit approximate projector to compute a suitable subspace 
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basis. Starting with an initial subspace Y, the following three main steps are repeated 
until a suitable convergence criterion is met: 


Compute a subspace U by approximately projecting Y 

Rayleigh—Ritz extraction: solve the reduced eigenproblem Ay V = By VA, 
where Ay = U" AU, By = U” BU, and let X = UV 

Obtain new Y from X or U 


In the following we highlight some of BEAST’s algorithmic features, skipping 
other topics such as locking converged eigenpairs, adjusting the dimension of the 
subspace, and others. 


3.2.1 Projector Types 


BEAST provides three variants of approximate projectors. First, polynomial approx- 
imation (BEAST-P) using Chebyshev polynomials, which only requires matrix 
vector multiplications but is restricted to standard eigenproblems. Second, Cauchy 
integral-based contour integration (BEAST-C), as in the FEAST method [63]. As 
a third method, an iterative implementation of the Sakurai—Sugiura method [65] is 
available (BEAST-M), which shares algorithmic similarities with FEAST. In the 
following we briefly elaborate on the algorithmic ideas. 


e In BEAST-P, we have U = p(A) - Y with a polynomial p(z) = > ck Tk(z) of 
suitable degree d. Here, Ty denotes the kth Chebyshev polynomial, 


To(z)=1, Tiı(z)=z, T(z) = 2z- Tk-1(z)— Tk-2 (2), k = 2. 


Due to the use of the Tę, this method is also known as Chebyshev filter 
diagonalization. 

In addition to well-known methods for computing the coefficients cz [18, 62], 
BEAST also provides the option of using new, improved coefficients [25]. Their 
computation depends on two parameters, jz and o, and for suitable combinations 
of these, the filtering quality of the polynomial can be improved significantly; 
see Fig.5, which shows the “gain,” i.e., the reduction of the width of those 
à values outside the search interval, for which a damping of corresponding 
eigenvectors by at least a factor 100 cannot be guaranteed. For some combinations 
(o, u), marked red in the picture, this “no guarantee” area can be reduced by 
a factor of more than 2, which in turn allows using lower-degree polynomials 
to achieve comparable overall convergence. A parallelized method for finding 
suitable parameter combinations and computing the cx is included with BEAST. 
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e In BEAST-C, the exact projection 


1 
fe (zB — A) |BY 
2ri Jr 


(integration is over a contour T in the complex plane that encloses the eigenvalues 
à € [A, A], but no others) is approximated using an N-point quadrature rule, 


N 
U =} aj(zjB — A) 'BY, 
j=l 


leading to N linear systems, where the number of right-hand sides (RHS) 
corresponds to the dimension of the current subspace U (and Y). 

e BEAST-M is also based on contour integration, but moments are used to reduce 
the number of RHS in the linear systems. Taking M moments, we have 


N 
U =[Uo,...,Um—1] with Uy = >> @jzi(zjB — A)'BY, 
j=l 


and thus an M times smaller number of RHS (dimension of Y) is sufficient to 
achieve the same dimension of U. 


The linear systems in the contour-based schemes may be ill-conditioned if the 
integration points z; are close to the spectrum (this happens, e.g., for narrow search 
intervals [A, A]); cf. also Sects. 3.1 and 3.4 for approaches to address this issue. 


3.2.2 Flexibility, Adaptivity and Auto-Tuning 


The BEAST framework provides flexibility at the algorithmic, parameter, and 
working precision levels, which we describe in detail in the following. 


Algorithmic Level 


The projector can be chosen from the three types described above, and the type 
may even be changed between iterations. In particular, an innovative subspace- 
iterative version of Sakurai—Sugiura methods (SSM) has been investigated for 
possible cost savings in the solution of linear systems via a limited subspace size 
and the overall reduction of number of right hand sides over iterations by using 
moments. Given, however, the potentially reduced convergence threshold with a 
constrained subspace size, we support switching from the multi-moment method, 
BEAST-M, to a single-moment method, BEAST-C. The efficiency, robustness, and 
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Fig. 5 Base-2 log of the “gain” from using modified coefficients with parameters (ø, 4) for the 
interval [A, A] = [—0.584, —0.560] (matrix scaled such that spec(A) = [—1,+1]) and degree 
d = 1600 


accuracy of this approach in comparison with traditional SSM and FEAST has been 
explored [35]. 

We further studied this scheme along with another performance-based imple- 
mentation of SSM, z-PARES [65, 67]. These investigations considered the scaling 
and computational cost of the libraries as well as heuristics for parameter choice, 
in particular with respect to the number of quadrature nodes. We observed that the 
scaling behavior improved when the number of quadrature nodes increased, as seen 
in Fig.6. As the linear systems solved at each quadrature node are independent 
and the quality of numerical integration improves with increased quadrature degree, 
exploiting this property makes sense, particularly within the context of exascale 
computations. However, it is a slightly surprising result, as previous experiments 
with FEAST showed diminishing returns for convergence with increased quadrature 
degree [23], something we do not observe here. 


Parameter Level 


In addition to the projector type, several algorithmic parameters determine the 
efficiency of the overall method, most notably the dimension of the subspace and the 
degree of the polynomial (BEAST-P) or the number of integration nodes (BEAST-C 
and BEAST-M). 

With certain assumptions on the overall distribution of the eigenvalues, clear 
recommendations for optimum subspace size (as a multiple of the number of 
expected eigenvalues) and the degree can be given, in the sense that overall work 
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Fig. 6 Strong scaling of BEAST and z-Pares for a 1M x 1M standard eigenproblem based on 
a graphene sheet of dimension 2000 x 500. Both solvers found 260 eigenpairs in the interval 
[—0.01, 0.01] to a tolerance of 1 x 1078. Both methods used 4 moments and began with random 
initial block vector Y. For BEAST, Y contained 100 columns; for z-Pares, 130. Testing performed 
on the Emmy HPC cluster at RRZE. MUMPS was used for the direct solution of all linear systems. 
N refers to the number of quadrature nodes along a circular contour, NP to the number of 
processes 


is minimized. For more details, together with a description of a performance-tuned 
kernel for the evaluation of p(A) - Y, the reader is referred to [62]. 

If such information is not available, or for the contour integration-type projectors, 
a heuristic has been developed that automatically adjusts the degree (or number of 
integration nodes) during successive iterations in order to achieve a damping of 
the unwanted components by a factor of 100 per iteration, which leads to close-to- 
optimum overall effort; cf. [26]. 


Working Precision Level 


Given the iterative nature of BEAST, with one iteration being comparatively 
expensive, the possibility to reduce the cost of at least some of these iterations is 
attractive. We have observed that before a given residual tolerance is surpassed, 
systematic errors in the computation of the projector and other operations do not 
impair convergence speed per se, but impose a limit on what residual can be 
reached before progress stagnates. One such systematic error is the finite accuracy 
of floating-point computations, which typically are available in single and double 
precision. In the light of the aforementioned behavior, it seems natural to perform 
initial iterations in single precision and thereby save on computation time before a 
switch to double precision becomes inevitable; cf. Fig. 7. 
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Fig. 7 Left: average residual over the BEAST iterations for using double or single precision 
throughout, and for switching from single to double precision in the 7th, 9th, or 11th iteration, 
respectively. Right: time (in seconds) to convergence for a size 1,048,576 topological insulator 
(complex values) with a search space size of 256 and a polynomial degree of 135 on 8 nodes of 
the Emmy-cluster at RRZE. Convergence is reached after identical numbers of iterations (with 
the exception of pure single precision, of course). The timings can vary for different ratios of 
polynomial degree and search space size and depend on the single precision performance of the 
underlying libraries 


Therefore, mixed precision has been implemented in all BEAST schemes 
mentioned above, allowing an adaptive strategy to automatically switch from single 
to double precision after a given residual tolerance is reached. A comprehensive 
description and results are presented in [4]. These results and our initial investi- 
gations also suggest that increased precision beyond double precision (i.e., quad 
precision) will have no benefit for the convergence rate until a certain double 
precision specific threshold is reached; convergence beyond this point would require 
all operations to be carried out with increased precision. 


3.2.3 Levels of Parallelism 


The BEAST framework exploits multiple levels of parallelism using an MPI+X par- 
adigm. We rely on the GHOST and PHIST libraries for efficient sparse matrix/dense 
vector storage and computation; cf. Sect. 5. The operations implemented therein are 
themselves hybrid parallel and constitute the lowest level of parallelism in BEAST. 
Additional levels are addressed by parallelizing over blocks of vectors in Y and, 
for BEAST-C and BEAST-M, over integration nodes during the application of the 
approximate projector. A final level is added by exploiting the ability of the method 
to subdivide the search interval [A, A] and to process the subintervals independently 
and in parallel. Making use of these properties, however, may lead to non-orthogonal 
eigenvectors, which necessitates postprocessing as explained in the following. 
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3.2.4 A Posteriori Cross-Interval Orthogonalization 


Rayleigh—Ritz-based subspace iteration algorithms naturally produce a B- 
orthogonal set of eigenvectors X, i.e., orth(X) is small, where 


(y, x) 
Ixy 


orth(X) = max {orth(x;, xj)li # j} with orth(x, y) = 


By contrast, the orthogonality 
orth(X, Y) = max {orth(x;, ys} 


between two or more independently computed sets of eigenvectors may suffer if 
the distance between the involved eigenvalues is small [49, 50]. Simultaneous re- 
orthogonalization of evolving approximate eigenvectors during subspace iteration 
has proven ineffective unless the vectors have advanced reasonably far. A large scale 
re-orthogonalization of finished eigenvector blocks, on the other hand, requires a 
careful choice of methodology in order to not diminish the quality of the previously 
established residual. 

Orthogonalization of multiple vector blocks implies Gram-Schmidt style propa- 
gation of orthogonality, assuming orth(X, Y) can be arbitrarily poor. In practice, the 
independently computed eigenvectors will exhibit multiple grades of orthogonality, 
but rarely will there be no orthogonality (in the sense above) at all. This, in turn, 
allows for the use of less strict orthogonalization methods. While, in theory, the 
orthogonalization of p blocks requires at least p(p — 1)/2 + (p — 1) block-block or 
intra-block orthogonalizations and ensures global orthogonality, an iterative scheme 
allows for more educated choices on the ordering of orthogonalizations in order 
to reduce losses in residual and improve the communication pattern, eliminating 
the need for broadcasts of vector blocks at the cost of additional orthogonalization 
operations in the form of multiple sweeps. In practice, very few sweeps (~2) are 
sufficient in most cases. 

Every block-block orthogonalization X = X — Y(Y4BX) disturbs the 
orthogonality orth(X) of the modified block, as well as its residual. Local 
re-orthogonalization of X disturbs the residual further. We have identified 
orthogonalization patterns and selected orthogonalization algorithms that reduce 
the loss of residual accuracy to a degree that essentially eliminates the need for 
additional post-iteration. 

The implementation of an all-to-all interaction of many participating vector 
blocks can be performed in multiple ways with different requirements regarding 
storage, communication, runtime, and with different implications on accuracy and 
loss of residual. Among several such strategies and algorithms that have been 
implemented and tested, the most promising is a purely iterative scheme, both 
for global and local orthogonalization operations. It is based on a comparison of 
interval properties, most notably the achieved residual from the subspace iteration. 
We are continuing to explore the possibility to detect certain orthogonalizations as 
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unnecessary without computing the associated inner products in order to further 
reduce the workload without sacrificing orthogonality. 


3.2.5 Robustness and Resilience 


In the advent of large scale HPC clusters, hardware faults, both detectable an 
undetectable, have to be expected. 

Detectable hardware faults, e.g., the outage of a component that violently halts 
execution, can typically only be mitigated by frequent on-the-fly storage of the most 
vital information. In the case of subspace iteration, as is used in BEAST, almost 
all required information for being able to resume computation is encoded in the 
iterated subspace basis in form of the approximate eigenvectors, besides runtime 
information about the general program flow. Relying on the CRAFT library [73], a 
per-iteration checkpointing mechanism has been implemented in BEAST. 

Additionally, for also being able to react to “silent” computation errors that 
merely distort the results but do not halt execution, the most expensive operation 
(application of the approximate projector) has been augmented to monitor the sanity 
of the results. This can be done in two ways: A checksum-style entrainment of 
additional vectors, linear combinations of the right-hand sides, can be checked 
during and after the application of the projector to detect errors and allow for 
the re-computation of the incorrect parts. The comparison of approximate filter 
values obtained from the computed basis and the expected values obtained from 
the scalar representation of the filter function, on the other hand, gives an additional 
a posteriori test for the overall plausibility of the basis. 

Practical tests have shown that small distortions of the subspace basis have not 
enough impact on the overall process in order to justify expensive measures. If the 
error is not recurring, just continuing the subspace iteration is often the best and 
most cost-efficient option. This is particularly true in early iterations, where small 
errors have no effect at all. 


3.3 Further Progress on Contour Integral-Based Eigensolvers 
3.3.1 Relationship Among Contour Integral-Based Eigensolvers 
The complex moment-based eigensolvers such as the Sakurai—Sugiura method can 


be regarded as projection methods using a subspace constructed by the contour 
integral 
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Fig. 8 A map of the relationships among the contour integral-based eigensolvers 


The property of the subspace is well analyzed by using a filter function 
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which approximates a band-pass filter for the target region where the wanted 
eigenvalues are located. Using the filter function, error analyses of the complex 
moment-based eigensolvers were shown in [30, 41, 42, 66, 76]. By using the results 
of the error analyses, an error resilience technique and an accuracy deterioration 
technique have also been given in [32, 43]. 

The relationship between typical complex moment-based eigensolvers was also 
analyzed in [42] focusing on the subspace. The block SS-RR method [36] and the 
FEAST algorithm [76] are projection methods for solving the target generalized 
eigenvalue problem, whereas the block SS-Hankel method [37], Beyn [12], the 
block SS-Arnoldi methods [40] and its improvements [38] are projection methods 
for solving an implicitly constructed standard eigenvalue problem; see [42] for 
details. Figure 8 shows a map of the relationships among the contour integral-based 
eigensolvers. 


3.3.2 Extension to Nonlinear Eigenvalue Problems 


The complex moment-based eigensolvers were extended to nonlinear eigenvalue 
problems (NEPs): 


T(Aj)xi =0, x; € CŒ \ {0}, A; EN CC, 
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where the matrix-valued function T : Q —> C”*" is holomorphic in an open domain 
Q. The projection for a nonlinear matrix function T (A) is given by 


1 k -1 
— | dzz T(z) Y. 
201i Jr 


This projection is approximated by 


N 
Ur = > @jziT(zj)'Y, k=0,1,...,m—1. 
j=l 


The block SS-Hankel [7, 8], block SS-RR [83], and block SS-CAA methods [39] 
are simple extensions of the GEP solvers. A technique for improving the numerical 
stability of the block SS-RR method for NEP was developed in [15, 16]. 

Beyn proposed a method using Keldysh’s theorem and the singular value 
decomposition [12]. Van Barel and Kravanja proposed an improvement of the Beyn 
method using the canonical polyadic (CP) decomposition [10]. 


3.4 Recursive Algebraic Coloring Engine (RACE) 


The standard approach to solve the ill-conditioned linear systems arising in BEAST- 
C or FEAST is to use direct solvers. However, in [24] it was shown that the 
Kaczmarz iterative solver accelerated by a Conjugate Gradient (CG) method (the 
so-called CGMN solver [29]) is a robust alternative to direct solvers. Standard 
multicoloring (MC) was used in [29] for the parallelization of the CGMN kernels. 
After analyzing the shortcomings of this strategy in view of hardware efficiency, we 
developed in collaboration with the EXASTEEL-II project the Recursive Algebraic 
Coloring Engine (RACE) [1]. It is an alternative to the well-known MC and 
algebraic block multicoloring (ABMC) algorithms [44], which have the problem 
that their matrix reordering can adversely affect data access locality. RACE aims 
at improving data locality, reducing synchronization, and generating sufficient 
parallelism while still retaining simple matrix storage formats such as compressed 
row storage (CRS). We further identified distance-2 coloring of the underlying 
graph as an opportunity for parallelization of the symmetric spMVM (SymmSpMV) 
kernel. 

RACE is a sequential, recursive, level-based algorithm that is applicable to 
general distance-k dependencies. It is currently limited to matrices with symmetric 
structure (undirected graph), but possibly nonsymmetric entries. The algorithm 
comprises four steps: level construction, permutation, distance-k coloring, and load 
balancing. If these steps do not generate sufficient parallelism, recursion on sub- 
graphs can be applied. Using RACE implies a pre-processing and a processing 
phase. In pre-processing, the user supplies the matrix, the kernel requirements 
(e.g., distance-1 or distance-2) and hardware settings (number of threads, affinity 
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strategy). The library generates a permutation and stores the recursive coloring 
information in a level tree. It also creates a pool of pinned threads to be used later. 
In the processing phase, the user provides a sequential kernel function which the 
library executes in parallel as a callback using the thread pool. 

Figure 9 shows the performance of SymmSpMV on a 24-core Intel Xeon 
Skylake CPU for a range of sparse symmetric matrices. In Fig. 9a we compare 
RACE against Intel’s implementation in the MKL library, and with roofline limits 
obtained via bandwidth measurements using array copy and read-only kernels, 
respectively. RACE outperforms MKL by far. In Fig.9b we compare against 
standard multicoloring (MC) and algebraic block multicoloring (ABMC). The 
advantage of RACE is especially pronounced with large matrices, where data traffic 
and locality of access is pivotal. One has to be aware that some algorithms may 
exhibit a change in convergence behavior due to the reordering. This has to be taken 
into account when benchmarking whole program performance instead of kernels. 
Details can be found in [1]. 

In order to show the advantages of RACE in the context of a relevant algorithm, 
we chose FEAST [63] for computing inner eigenvalues. The hot spot of the 
algorithm (more than 95%) is a solver for shifted linear systems (A—o I = b). These 
systems are, however, highly ill-conditioned, posing severe convergence problems 
for most linear iterative solvers. We use the FEAST implementation of Intel 
MKL, which by default employs the PARDISO direct solver [69], but its Reverse 
Communication Interface (RCI) allows us to plug our CGMN implementation 
instead. In the following experiment we find ten inner eigenvalues of a simple 
discrete Laplacian matrix to an accuracy of 1078. Figure 10 shows the measured 
time and memory footprint of the default MKL version (using PARDISO) and the 
CGMN versions parallelized using both RACE and ABMC for different matrix 
sizes. ABMC is a factor of 4x slower than RACE. The time required by the default 
MKL with PARDISO is smaller than with CGMN using RACE for small sizes; 
however, the gap gets smaller as the size grows due to the direct solvers having a 
higher time complexity (here ~ O(n?) compared to iterative methods (~ O(n! 56)). 
Moreover, the direct solver requires more memory, and the memory requirement 
grows much faster (see Fig. 10b) than with CGMN. In our experiment the direct 
solver ran out of memory at problem sizes beyond 1403, while CGMN using RACE 
used less than 10% of space at this point. Thus, CGMN with RACE can solve much 
larger problems compared to direct solvers, which is a major advantage in fields like 
quantum physics. 


4 Hardware Efficiency and Scalability 


In this section we showcase performance and parallel efficiency numbers for 
library components developed in ESSEX-II that are of paramount importance 
for the application work packages: GPGPU-based tall and skinny matrix-matrix 
multiplication and the computation of inner eigenvalues using polynomial filter 
techniques. 
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Fig. 9 SymmSpMV performance of RACE compared to other methods. The roofline model 


[17] and 


ScaMaC (see Sect. 5.5) were used. Note that the matrices are ordered according to increasing 


for SymmSpMV is shown in Fig.9a for reference. Representative matrices from 


number of rows. (One Skylake Platinum 8160 CPU [24 threads]). (a) Performance of RACE 


compared with MKL. (b) Performance of RACE compared to other coloring approaches 
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Fig. 10 Comparison of FEAST with default MKL direct solver and iterative solver CGMN, 
parallelized using RACE. (One Skylake Platinum 8160 CPU [24 threads]). (a) Time to solution. 
(b) Memory requirement 


4.1 Tall and Skinny Matrix-Matrix Multiplication (TSMM) on 
GPGPUs 


Orthogonalization algorithms frequently require the multiplication of matrices that 
are strongly nonsquare. Vendor-supplied optimized BLAS libraries often yield sub- 
optimal performance in this case. “Sub-optimal” is a well-defined term here since 
the multiplication of an M x K matrix A witha K x N matrix B with K > M, N and 
small M, N isamemory-bound operation: At M = N, its computational intensity is 
just M/8 flop/byte. In ESSEX-I, efficient implementations of TSMM on multicore 
CPUs were developed [51]. 

The naive roofline model predicts memory-bound execution for M < 64 on 
a modern Volta-class GPGPU. See Fig. 11 for a comparison of optimal (roofline) 
performance and measured performance for TSMM on an Nvidia Tesla V100 
GPGPU using the cuBLAS library. We have developed an implementation of 
TSMM for GPGPUs [20], investigating various optimization techniques such as 
different thread mappings, overlapping long-latency loads with computation via 
leapfrogging’ and unrolling, options for global reductions, and register tiling. Due 


Shttps://docs.nvidia.com/cuda/cublas (May 2019). 


7Leapfrogging in this context means that memory loads to operands are initiated one loop iteration 
before the data is actually needed, allowing for improved overlap between data transfers and 
computations. 
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Fig. 11 Percentage of roofline predicted performance achieved by cuBLAS for the range M = 
N e [1, 64] on a Tesla V100 with 16 GB of memory. (From [20]) 
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Fig. 12 Best achieved performance for each matrix size with M = N in comparison with the 
roofline limit, cuBLAS and CUTLASS, with K = 273. (From [20]) 


to the large and multi-dimensional parameter space, the kernel code is generated 
using a python script. 

Figure 12 shows a comparison between our best implementations obtained 
via parameter search (labeled “leap frog” and “no leap frog,” respectively) with 
cuBLAS and CUTLASS,® which is a collection of CUDA C++ template abstrac- 
tions for high-performance matrix multiplications. Up to M = N = 36, our 
implementation stays within 95% of the bandwidth limit. Although the performance 


Shttps://github.com/NVIDIA/cutlass (May 2019). 
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levels off at larger M, N, which is due to insufficient memory parallelism, it is still 
significantly better than with cuBLAS or CUTLASS. 


4.2 BEAST Performance and Scalability on Modern Hardware 
4.2.1 Node-Level Performance 


Single-device benchmark tests for BEAST-P were performed on an Intel Knights 
Landing (KNL), an Nvidia Tesla P100, and an Nvidia Tesla V100 accelerator, com- 
paring implementations based on vendor libraries (MKL and cuBLAS/cuSPARSE, 
respectively) with two versions based on GHOST: one with and one without tailored 
fused kernels. The GPGPUs showed performance levels expected from a bandwidth- 
limited code, while on KNL the bottleneck was located in the core (see Fig. 13a). 
Overall, the concept of fused optimized kernels provided speedups of up to 2x 
compared to baseline versions. Details can be found in [56]. 
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Fig. 13 BEAST-P performance for a topological insulator problem of dimensions 128x64x64 
with np = 500 using different implementations on KNL, P100, and V100. Performance of a dual 
Xeon E5-2697v3 node (Haswell) is shown for reference. Note the different y axis scaling of the 
V100 results. (From [56]; for details see therein). (a) KNL. (b) P100. (c) V100 
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4.2.2 Massively Parallel Performance 


Scaling tests for BEAST-P were performed on the “Oakforest-PACS” (OFP) at the 
University of Tokyo, “Piz Daint” at CSCS in Lugano, and on the “SuperMUC- 
NG” (SNG) at Leibniz Supercomputing Centre (LRZ) in Garching.” While the 
OFP nodes comprise Intel “Knights Landing” (KNL) many-core CPUs, SNG has 
CPU-only dual-socket nodes with Intel Skylake-SP, and Piz Daint is equipped with 
single-socket Xeon “Haswell” nodes, each of which has an Nvidia Tesla P100 
accelerator attached. Weak and strong scaling tests were done with topological 
insulator (TI) matrices generated by the ScaMaC library. Flops were calculated for 
the computation of the approximate eigenspace, U, averaged over the four BEAST 
iterations it took to find the 148 eigenvalues in each interval to a tolerance of 
1 x 107!°. The subspace contained 256 columns, and spMMVs were performed 
in blocks of size 32 for best performance. Optimized coefficients [25] were used 
for the Chebyshev polynomial approximation, resulting in a lower overall required 
polynomial degree. Weak and strong scaling results are shown in Fig. 14a through d. 

OFP and SNG show similar weak scaling efficiency due to comparable single- 
node performance and network characteristics. Piz Daint, owing to its superior 
single-node performance of beyond 400 Gflop/s, achieves only 60% of parallel 
efficiency at 2048 nodes. A peculiar observation was made on the CPU-only 
SNG system: Although the code runs fastest with pure OpenMP on a single node 
(223 Gflop/s), scaled performance was observed to be better with one MPI process 
per socket. The ideal scaling and efficiency numbers in Fig. 14a—c use the best value 
on the smallest number of nodes in the set as a reference. The largest matrix on SNG 
had 6.6 x 10° rows. 


5 Scalable and Sustainable Software 


It was a central goal of the ESSEX-II project to consolidate our software efforts and 
provide a library of solvers for sparse eigenvalue problems on extreme-scale HPC 
systems. This section gives an overview of the status of our software, most of which 
is now publicly available under a three-clause BSD license. Many of the efforts 
have been integrated in the PHIST library so that they can easily be used together, 
and we made part of the software available in larger contexts like Spack [27] and 
the extreme-scale scientific software development kit xSDK [11]. The xSDK is an 
effort to define common standards for high-performance, scientific software in terms 
of software engineering and interoperability. 


°Runs on OFP and SNG were made possible during the “Large-scale HPC Challenge” Project on 
OFP and the “Friendly-User Phase” of SNG. 
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Fig. 14 Weak scaling of BEAST-P on OFP, Piz Daint, and SNG, and strong scaling on SNG. 
Dashed lines denote ideal scaling with respect to the smallest number of nodes in the set. (a) Weak 
scaling of BEAST-P on OFP for problems of size 220 (2 nodes) to 23° (2048 nodes, about one 
quarter of the full machine). (b) Weak scaling of BEAST-P on Piz Daint for problems of size 27! 
(1 node) to 23” (2048 nodes). (From [56]). (c) Weak scaling of BEAST-P on SNG for problems of 
size 2?! (1 node) to 1.53 x 232 (3136 nodes, about half of the full machine). (d) Strong scaling of 
BEAST-P on SNG for problems of size 2?8 (crosses) and 230 (triangles) 


The current status of the software developed in the ESSEX-II project is summa- 
rized as follows. 


e BEAST is available via bitbucket,!? and can be compiled either using the PHIST 
kernel interface or the GHOST library directly. The former allows using it with 
any backend supported by PHIST. 

e CRAFT is available stand-alone!! or (in a fixed version) as part of PHIST. 

e ScaMaC is available stand-alone!” or (in a fixed version) as part of PHIST. 


!0https://bitbucket.org/essex/beast/. 
' https://bitbucket.org/essex/craft/. 
-https://bitbucket.org/essex/matrixcollection/. 
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e GHOST is available via bitbucket.!* The functionality which is required to 
provide the PHIST interface can be tested via PHIST. Achieving full (or even 
substantial) test coverage of the GHOST-functionality would require a very large 
number of tests (in addition to what the PHIST interface provides, GHOST allows 
mixing data types, and it uses automatic code generation, which leads to an 
exponentially growing number of possible code paths with every new kernel, 
supported processor and data type). It is, however, possible to create a basic 
GHOST installation via the Spack package manager (since March 2018, commit 
bcde376). 

e PHIST is available via bitbucket!* and Spack (since commit 2e4378b). 
Furthermore, PHIST 1.7.5 is part of xSDK 0.4.0. The version distributed with 
the xSDK is restricted to use the Tpetra kernels to maximize the interoperability 
of the package. 


5.1 PHIST and the Block-ILU 


In ESSEX-I we addressed mostly node-level performance [77] on multi-core CPUs. 
The main publication of ESSEX-II concerning the PHIST library [78] presents 
performance results for the block Jacobi-Davidson QR (BJDQR) solver on various 
platforms, including recent CPUs, many-core processors and GPUs. It was also 
shown in this work that the block variant has a clear performance advantage over 
the single-vector algorithm in the strong scaling limit. The reason is that, while 
the number of matrix-vector multiplications increases with the block size (see 
also [64]), the total number of reductions decreases. In order to demonstrate the 
performance portability of PHIST, we show in Fig. 15 a weak scaling experiment 
on the recent SuperMUC-NG machine. 

For the block size 4, we roughly match the performance it achieves in the 
memory-bounded HPCCG benchmark (207 TFlop/s),!> but using only half of the 
machine. This gives a clear indication that our node-level performance engineering 
and multi-node implementation are highly successful: after all, we do not optimize 
for the specific operator application (a simple structured grid, 3D Laplace operator), 
which the HPCCG code does. On the other hand, we have an increased compu- 
tational intensity for some of the operations due to the blocking, which increases 
the performance over a single-vector CG solver. The single-vector BJDQR solver 
achieves 98 TFlop/s on half of the machine. 


'Shttps://bitbucket.org/essex/ghost/. 
'4https://bitbucket.org/essex/phist. 
15 See https://www.top500.org/system/179566. 
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Fig. 15 Weak scaling behavior of the PHIST BJDQR solver for a symmetric PDE benchmark 
problem and different block sizes 


5.1.1 Integration of the Block-ILU Preconditioning Technique 


Initial steps have been taken to make the Block-ILU preconditioner (cf. Sect. 3.1) 
available via the PHIST preconditioning interface. At the time of writing, there is 
an experimental implementation of a block CRS sparse matrix format in the PHIST 
builtin kernel library, including parallel conversion and matrix-vector product rou- 
tines and the possibility to construct and apply the block Cholesky preconditioner. 
Furthermore, the interfaces necessary to allow using the preconditioner within 
the BJDQR eigensolver have been implemented. These features are available for 
experimenting in a branch of the PHIST git repository because they do not yet meet 
the high demands on maintainability (especially unit testing) and documentation of 
a publicly available library. Integration of the method with the BEAST eigensolver 
is not yet possible because the builtin kernel library does not support complex 
arithmetic. As mentioned in Sect. 3.2, the complex version will be integrated directly 
into the BEAST software, instead. 


5.2 BEAST 


BEAST combines implementations of spectral filtering methods for Rayleigh- 
Ritz type subspace iteration in a generalized framework to provide facilities for 
improving performance and robustness. The algorithmic foundation allows for the 
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solution of interior Hermitian definite eigenproblems of standard and generalized 
form via an iterative eigensolver, unveiling all eigenpairs in one or many specified 
intervals. The software is designed as hybrid parallel library, written in C/C++, 
and relying on GHOST and PHIST to provide basic operations, parallelism, and 
data types. Beyond the excellent scalability of the underlying kernel libraries, 
multiple additional levels of parallelism allow for computing larger portions of the 
spectrum and/or utilizing a larger number of computing cores. The inherent ability 
of the underlying algorithm to compute separate intervals independently offers wide 
potential but requires careful handling of cross-interval interactions to ensure the 
desired quality of results, which is well supported by BEAST. 

The BEAST library interface comes in variations for the common floating point 
formats (real and complex, single and double precision) for standard and gener- 
alized eigenproblems. Additionally, the software offers the possibility to switch 
precisions on-the-fly, from single to double precision, in order to further improve 
performance. While BEAST offers an algorithm for standard eigenproblems that 
completely bypasses the need for linear system solves, other setups typically require 
a suitable linear solver. Besides a builtin parallel sparse direct solver for banded 
systems, BEAST includes interfaces to MUMPS and Strumpack, as well as a 
flexible callback-driven interface for the inclusion of arbitrary linear solvers. It also 
interfaces with CRAFT and ScaMaC, which provide fault tolerance and dynamic 
matrix generation, respectively. While working out of the box for many problems, 
BEAST offers a vast amount of options to tweak the software for the specific 
problem at hand. A builtin command line parser allows for easy modification. The 
included application bundles the several capabilities of BEAST in form of a stand- 
alone tool that reads or generates matrices and solves the specified eigenproblem. 
As such, it acts as comprehensive example for the usage of BEAST. 

The library is still in a development state, and interface and option sets may 
change. A more comprehensive overview over a selection of features is provided in 
Sect. 3.2. 


5.3 CRAFT 


The CRAFT library [73] covers two essential aspects of fault tolerance namely 
communication, and data recovery of an MPI application in case of process-failures. 

In the Checkpoint/Restart part of the library, it provides an easier and extensible 
interface for making application-level checkpoint/restart. A CRAFT checkpoint 
can be defined simply by defining a Checkpoint object and adding the restart- 
relevant data in it, as shown in Listing 1. By default, the Checkpoint: : add () 
function supports the most frequently used data formats, e.g., “plain old data” 
(POD), i.e., int, double, float, etc., POD 1D- and 2D-arrays, MPI data-types, 
etc.. However, it can be easily extended to support any user defined data-types. 
The Checkpoint: :read(),write() and update() methods can then be 
used to read/write all added checkpoint’s data. The library supports asynchronous- 
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| #include <mpi.h> 
2 #include <craft.h> 
3 int main(int argc, charx argv[]){ 


5 size_t n=5, myrank, iteration=1, cpFreq=10; 
6 double dbl = 0.0; 
int * dataArr = new int[n]; 
MPI_Comm FT_Comm; 
9 MPI_Comm_dup (MPLCOMM_WORLD, &FT_Comm) ; 
10 AFT_BEGIN (FT_Comm, &myrank, argv); 
11 2k 26 2 2 k k k 26 2k 2k ək a k 2k 2 2 2 2 k 2k 2k 2 2k 2k k 2 2 2 2k k 2 2k 2k 2 k 2 2 Kk k 2K k kk k 


12 Checkpoint myCP ("myCP", FI_Comm) ; * //define checkpoint 
13 myCP.add ("dbl", &dbl); * 

14 myCP.add ("iteration ", &iteration); * 

15 myCP.add ("dataArr", dataArr, &n); AFT Zone 

16 myCP. commit () ; * 

17 myCP. restartIfNeeded (&iteration); 

18 for(; iteration <= 100 ; iteration++){ 


ES 
x 
19 Computation_communication() ; * 
20 modifyData(&dbl, dataArr); * 
21 myCP.updateAndWrite (iteration , cpFreq); * 
oo 
EJ 
* 


} 


2k 2k k k k kk k k 2K k k kk 2K kk 2K k k kK k 2k k k k 2K 2k 2k 2K 2k 2K k ÞK 2K Kk k k K k k 
AFT END () ; 


Listing 1 A toy-code that demonstrates the simplicity of CRAFT’s checkpoint/restart and 
automatic fault tolerace features in a typical iterative-style scientific application 


checkpointing as well as node-level checkpointing using the SCR library [68]. 
Moreover it supports multi-staged, nested-, and signal-checkpointing. 

The Automatic Fault Tolerance (AFT) part of CRAFT provides an easier 
interface for a dynamic process-failure recovery and management. CRAFT uses 
the ULFM-MPI implementation for process-failure detection, propagation, and 
communication recovery procedures, however it considerably reduces the user’s 
effort by hiding these details behind AFT_BEGIN () and AFT_END () functions 
as shown in Listing 1. After a process failure, the library recovers the broken 
communicator (shrinking or non-shrinking by process-spawning), and returns the 
control back to the program at AFT_BEGIN (), where the data can be recovered. 
Both of these CRAFT functionalities are designed to complement each other, 
however they can be used independently as well. For detailed explanation of the 
features included in CRAFT, check [73]. Moreover, the library is available at [72]. 


5.4 CRAFT Benchmark Application 


Within the scope of ESSEX, we have integrated CRAFT in the GHOST and PHIST 
libraries, and the BEAST algorithm. 

Figure 16 shows a benchmark comparing the overhead of three different check- 
pointing strategies for the Lanczos algorithm (GHOST-based eigensolver), Jacobi— 
Davidson (PHIST-based eigensolver), and the BEAST algorithm. The important 
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Fig. 16 CRAFT checkpointing overhead comparison for the Lanczos, Jacobi—Davidson, and 
BEAST eigenvalue solvers using three checkpointing methods of CRAFT, namely, node-level 
checkpointing with SCR, asynchronous PFS, and synchronous PFS checkpoints. The overhead for 
each checkpoint case is shown as a percentage. (Number of nodes=128, number of processes=256, 
Intel MPI) 


parameters for these benchmarks are listed in Table 1. The benchmark shows that 
the node-level and asynchronous checkpointing significantly reduces the checkpoint 
overhead despite a very high checkpoint frequency. 

The benchmark presented in Fig. 17 demonstrates the overhead caused by 
checkpoint/restart as well as by the communication recovery after process failures 
for the Lanczos application. The first two bars, namely ‘No CP Intel MPI’ and 
‘No CP ULFM-MPI show the runtime between non-fault-tolerant (Intel-MPI) vs. a 
fault-tolerant MPI implementation (ULFM-MPI), and creates a baseline for ULFM- 
MPI implementation without any failures. The next two groups of bars show the 
application runtime with 0-,1-, and 2-failures with checkpoints taken on PFS- and 
node-level. The failures are triggered at the mid-point of two successive checkpoints 
from within the application to have a deterministic re-computation time, where 
each failure simulates a complete node-crash (2 simultaneous process failures) 
and recovery is performed in a non-shrinking fashion on spare nodes. The largest 
contribution to the overhead is caused by the re-computation part, whereas the 
communication repair overhead takes an average of ~ 2.6s only. 

Besides ESSEX, CRAFT has been utilized in [22] to create a process-level fault 
tolerant FEM code based on the shrinking recovery style. Moreover, CRAFT has 
been recently integrated in the EXASTEEL [21] project. 
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Table 1 The parameter values for Lanczos, JD, and Beast benchmarks 


Lanczos parameters 


Matrix Graphene-3000- Number of rows and 9.0 - 108 
3000 columns 


Number of non-zeros 11.7 -10° Global checkpoint size ~14.4 GB 
Number of iterations 3000 Checkpoint frequency 500 


Jacobi-Davidson parameters (using Phist) 


columns 
eigenvalues 
Number of checkpoints [100 


Global checkpoint size ~32 GB Backend support library | GHOST 
Beast parameters 


Matrix tgraphene: 12000, | Number rows and 1.44 - 108 
12000, 0 columns 
Beast iterations 9 Checkpoint frequency 2 


Global checkpoint size ~65 GB Backend support library | GHOST 
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Fig. 17 Lanczos application with various checkpoint/restart and process failure recovery scenarios 
using 128 nodes (256 processes) on the RRZE Emmy cluster. On average the communication 
recovery time is 2.6s (ULFM-MPI v1.1) 


174 C. L. Alappat et al. 


5.5 ScaMaC 


Sparse matrices are central objects in the ESSEX project because of its focus on 
large-scale numerical linear algebra problems. A sparse matrix, whether derived 
from the Hamiltonian of a quantum mechanical system, from the Laplacian in a 
partial differential equation, or simply given as an abstract entity with unknown 
properties, defines a problem to be solved. The solution may then consist of a set 
of eigenvalues and eigenvectors computed with the BEAST or Jacobi—Davidson 
algorithms or, more moderately, of an estimate of some matrix norm or the spectral 
radius. 

Testing and benchmarking of linear algebra algorithms, but also of computational 
kernels such as spMVM, requires matrices of different type and different size. 
Standard collections such as the Matrix Market [58] or Florida Sparse Matrix 
Collection [17] cover a wide range of examples, but mainly provide matrices of 
fixed moderate size. As algorithms and implementations improve, such matrices 
become readily too small and limited to serve as realistic test and benchmark cases. 

We therefore decided in the ESSEX project to establish a collection of scalable 
matrices—the ScaMaC. Every matrix in ScaMaC is parameterized by individual 
parameters that allow the user to scale up the matrix dimension and to modify 
other, for example spectral, properties of the matrix. ScaMaC includes simple 
test and benchmark matrices but also ‘real-world’ matrices from research studies 
and applications. A major goal of ScaMaC is to provide a flexible yet generic 
interface for matrix generation, together with the necessary infrastructure to allow 
for immediate access to the collection irrespective of the concrete usage case. 

The ScaMaC approach to matrix generation is straightforward and simple: 
Matrices are generated row-by-row (or column-by-column). The entire complexity 
of the actual generation technique, which depends on the specific matrix example, 
is encapsulated in a ScamacGenerator type and hidden from the user. ScaMaC 
provides routines to create and destroy such a matrix generator, to query matrix 
parameters prior to the actual matrix generation, and to obtain each row of the 
matrix. The ScaMaC interface is entirely generic and identical for all matrices in 
the collection. 

A minimal code example is given in Fig. 18. In this example, the matrix and 
its parameters are set by parsing an argument string of the form "MatrixName, 
parameter=...,..." in line 3, before all rows are generated in the loop in 
lines 12-17. As this examples shows, parallelization of matrix generation is not 
part of the ScaMaC, but lies within the responsibility of the calling program. All 
ScaMaC routines are thread-safe and can be embedded directly into MPI processes 
and OpenMP threads. This approach guarantees full flexibility for the user and is 
easily integrated into existing parallel matrix frameworks such as PETSc or Trilinos. 
Both BEAST and PHIST provide direct access to the ScaMaC, therefore freeing the 
user from any additional considerations when using ESSEX software. 
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// step 1: obtain a generator - per process 

ScamacGenerator * my_gen; 

err = scamac_parse_argstr("“Hubbard,n sites=20", &my_gen, Serrstr); 
err = scamac_generator_finalize(my gen); 


oR WN 


// step 2: allocate workspace - per thread 
ScamacWorkspace * my_ws; 


a 


x 


8 err = scamac_workspace_alloc(my_ gen, &my ws); 

E E NETET, 

0 // step 3: generate the matrix row by row 

1 ScamacIdx nrow = scamac_generator_query_nrow(my_gen); 

2 for (idx=0; idx<nrow; idx++) { // parallelize loop with OpenMP, MPI, ... 
3 // obtain the column indices and values of one row 

4 err = scamac_generate_row(my_gen, my ws, idx, SCAMAC_ DEFAULT, &nz, cind, val); 
5 // store or process the row 

Ganapati carina asta "a wrincea 

T J 

8 // step 4: clean up 

9 err = scamac_workspace_free(my ws) ; // in each thread 


>i 


err = scamac_generator_destroy(my_gen); // in each process 
// step 5: use matrix 


Nwy 
N = 


Fig. 18 Code example for row-by-row matrix generation with the generic ScaMaC generators 


ScaMaC is written in plain C. Auto-generated code is included already in the 
release, such that requirements at compile time are minimal. Interoperability with 
other programming languages is straightforward, e.g., by using the ISO C bindings 
of the FORTRAN 2003 standard. Runtime requirements are equally minimal. 
Matrix generation has negligible memory overhead, requiring only a few KiB 
workspace to store lookup tables and similar information. 

The key feature of ScaMaC is scalability, since the matrix rows (or columns) can 
be generated independently and in arbitrary order. For example at Oakforest-PACS 
(see Sect. 4.2.2), a Hubbard matrix (see below) with dimension >9 x 10? and 
>1.5 x 10!! non-zeros is generated in less than a minute, using 2!° MPI processes 
each of which generates an average of 1.5 x 10° rows per second. As explained, the 
task of efficiently storing or using the matrix is left to the calling program. 

ScaMaC is accompanied by a small toolkit for exploration of the collection. The 
toolkit addresses some basic tasks such as querying matrix information or plotting 
the sparsity pattern, but is not intended to compete with production-level code 
or full-fledged solver libraries, which the ESSEX project provides with BEAST, 
GHOST, and PHIST. 

At the moment,! the matrix generators included in ScaMaC strongly reflect 
our personal research interests in quantum physics, but the ScaMaC framework is 
entirely flexible and allows for easy inclusion of new types of matrices, provided 
that they can be generated in a scalable way. The next update (scheduled for 2020) 
will extend ScaMaC primarily with matrix generators for standard partial differ- 


!6Tn version 0.8.2, ScaMaC contains 15 different matrix generators with a total of 95 parameters. 
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ential equations, including stencil matrices and finite element discretizations for 
advection-diffusion and elasticity problems, wave propagation, and the Schrédinger 
equation. Additional examples that are well suited for scalable generation are 
regular and irregular graph Laplacians, which have gained renewed interest in the 
context of machine learning [14, 59]. 

To obtain an idea of the ‘real-world’ application matrices already contained in 
ScaMaC, consider two examples: The celebrated Hubbard model of condensed 
matter physics (Hubbard) [34] and a theoretical model for excitons in the cuprous 
oxide from our own research in this field (Exciton) [2]. These matrices appear 
as Hamiltonians in the Schrödinger equation, and thus are either symmetric real 
(Hubbard) or Hermitian complex (Exciton). The respective application requires 
a moderate number (typically, 10-1000) of extremal or interior eigenpairs, which 
is less than 0.1% of the spectrum. Other ScaMaC generators provide general 
(non-symmetric or non-Hermitian) matrices, with a variety of sparsity patterns, 
spectral properties, etc. All generators depend on a number of application-specific 
parameters,'’ which are partly listed in Table 2 for the Hubbard and Exciton 
generator. 

For the Hubbard example, two parameters determine the matrix dimension and 
sparsity pattern: n_fermions gives the number electrons per spin orientation (up or 
down), n_sites the number of orbitals occupied by the electrons. In terms of these 
parameters, the matrix dimension is D = (, ai This dependency results in 
the rapid growth of D shown in Table 3. In the physically very interesting case of 
half-filling (n_fermions = n_sites/2 = n) we have asymptotically D ~ 2”/./Gr/2)n, 
that is, exponential growth of D. 

The Exciton example has the more moderate dependence D = 3(2L + 1)? 
(see Table 3). Here, the parameter L is a geometric cutoff that limits the maximal 
distance between the electron and hole that constitute the exciton. This example has 
a number of other parameters that are adapted literally from [2]. These parameters 
enter into the matrix entries, and thus affect the matrix spectrum and, finally, the 
algorithmic hardness of computing the eigenvalues of interest that determine the 
physical properties of the exciton. 

Both Hubbard and Exciton are examples of difficult matrices, albeit for 
different reasons. For Hubbard, one unresolved challenge is to compute multiple 
interior eigenvalues for large n_fermions, n_sites, which becomes extremely diffi- 
cult because of the rapid growth of the matrix dimension (specialized techniques for 
the Hubbard model such as the density-matrix renormalization group [70] cannot 
compute interior eigenvalues). Due to the irregular sparsity pattern of the Hubbard 
matrices (see Fig. 19 below), already the communication overhead of spMVM 
poses a Serious obstacle to scalability and parallel efficiency. For Exciton, which 
are essentially stencil-like matrices of moderate size, the challenge is to compute 
some hundred eigenvalues out of a strongly clustered spectrum. Here, it is the 


'7For a full list of generators and parameters, consult the ScaMaC documentation included with 
the code, or at https://alvbit.bitbucket.io/scamac_docs/_matrices_page.html. 
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Table 2 Parameters of the Hubbard and Exciton matrix generator in the ScaMaC 


Hubbard Exciton 
matrix type: symmetric real matrix type: Hermitian complex 

int n_sites number of sites int L cube length 

int nfermions number of fermions double so spin orbit 
double t hopping strength double ex exchange 
double U Hubbard interaction double mlh mass light hole 
double mhh mass heavy hole 

double me mass electron 

1 ielectri 
double ranpot random potential I A dielectric constant 
rngseed seed random seed 


Table 3 Matrix dimension D for the Hubbard and Exciton exam- 
ple, as a function of the respective parameter n_sites (and default value 
n_fermions = 5) or L 


Hubbard Exciton 
n_sites D L D 
10 63 504 10 27 783 
15 9 018 009 20 206 763 
20 240 374 016 50 3 090 903 
25 2 822 796 900 100 24 361 803 
30 20 307 960 036 150 81 812 703 
40 432 974 528 064 200 193 443 603 


Fig. 19 Sparsity pattern of the Hubbard (Hubbard,n sites=40,n fermions=20) and 
spin chain (SpinChainxxZ,n_sites=32,n_up=8) example 


poor convergence of iterative eigenvalue solvers for nearly degenerate eigenvalues 
that renders this problem hard. Thanks to the algorithmic advances in the ESSEX 
project, we now have reached a position that allows for future progress on these 
problems. 

ScaMaC comes with several convenient features. For example, the Hubbard 
matrix includes the parameter ranpot to switch on a random potential. Random 
numbers in ScaMaC are entirely reproducible, and independent of the number of 
threads or processes that call the ScaMaC routines, or of the order in which the 
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matrix rows are generated. An identical random seed gives the same matrix under 
all circumstances. In particular, individual matrix rows can be reconstructed at 
any time, which simplifies a fault-tolerant program design (see Sect. 5.3). Another 
feature is the possibility to effortlessly generate the (conjugate) transpose of non- 
symmetric (non-Hermitian) matrices, which is considerably easier than constructing 
the transpose of a (distributed) sparse matrix after generation. 


6 Application Results 


6.1 Eigensolvers in Quantum Physics: Graphene, Topological 
Insulators, and Beyond 


Because of the linearity of the Schrödinger equation, quantum physics is a paradigm 
for numerical linear algebra applications. Historically, some application cases, such 
as the computation of the ground state (i.e. of the eigenvector to the minimal 
eigenvalue), have received so much attention that only gradual progress remains 
possible nowadays. In the ESSEX project we instead address two major cases 
where novel algorithmic improvements and systematic utilization of large-scale 
computing resources through state-of-the-art implementations still result in sub- 
stantial qualitative progress. These two cases are the computation of (i) extreme 
eigenvalues with high degeneracy, which is addressed with a block Jacobi—Davidson 
algorithm, (ii) multiple interior eigenvalues, which is addressed by various filter 
diagonalization techniques. Application case (i) has been documented in [64], 
including the example of spin chain matrices (SpinChainXXZ in the ScaMaC). 
For application case (ii) the primary quantum physics example are graphene [13] 
and topological insulators [31] (Graphene and TopIns in the ScaMaC). For these 
examples, eigenvalues towards the center of the spectrum, near the Fermi energy of 
the material, are those of interest. This situation is similar to applications in quantum 
chemistry and density functional theory, but in our case the matrices represent a full 
(disordered or structured) two or three-dimensional domain, and are usually larger 
than those considered elsewhere [57]. 

Starting with the paper [62] on Chebyshev filter diagonalization (ChebFD) and 
culminating in the BEAST software package (see Sect.3.2), the computation of 
interior eigenvalues of large-to-huge graphene and topological insulator matrices 
has been successfully demonstrated with ESSEX algorithms, using polynomial 
filters derived from Chebyshev polynomials. Already with the simple ChebFD 
algorithm we could compute Nr œ 100 eigenvectors from the center of the 
spectrum of a matrix with dimension D ~ 10° (i.e. an effective problem size 
Nr x D ~ 10!'), in order to understand the electronic properties of a structured 
topological insulator (see Figure 13 in [62]). With improved filter coefficients and 
a more sophisticated implementation, the polynomial filters in the (P-) BEAST 
package deal with such problems at reduced computational cost (see Sect. 3.2). 


ESSEX: Equipping Sparse Solvers For Exascale 179 


Such large-scale computations heavily rely on the optimized ssMMVM and TSMM 
kernels of the GHOST library (see Sect. 5). 

To appreciate the numerical progress reflected in these numbers one should 
note the different scaling of the numerical effort Nmym (measured in terms of 
the dominant operation of spMVM) for the computation of extreme and interior 
eigenvalues (cf. the discussion in [62]). In an idealized situation with equidistant 
eigenvalues, we have roughly Nmvm ~ D'/2 for extreme but Nmym ~ D for 
interior eigenvalues. For the D ~ 10° example, we have to compensate for a factor 
104-105 to enable computation of interior instead of extreme eigenvalues. 

Algorithm and software development in ESSEX has been to a large degree 
application-driven. Now, at the end of the ESSEX project, where the algorithms 
for our main application cases have become available, we follow two ways to go 
beyond the initial quantum physics applications. First, entirely new applications 
can now be addressed with ESSEX software, extending our efforts to non-linear 
and non-Hermitian problems (see Sect.6.2). Second, relevant applications such 
as the Hubbard and Exciton examples (see Sect.5.5) still fit into the two 
major application areas already addressed in ESSEX, but further increase the 
computational complexity. For Exciton, the strongly clustered spectrum with 
many nearly-degenerate eigenvalues leads to a numerical effort Nuym >> D! 
already for extreme eigenvalues. For Hubbard, the huge matrix dimension D is a 
serious obstacle for the computation of interior eigenvalues. 

The Hubbard matrices also hint at an application-specific issue of general 
interest that we encountered but could not solve within ESSEX. Specifically, it is 
the complicated sparsity pattern of many of our quantum physics matrices (see 
Fig. 19) that adversely affects the parallel efficiency of distributed spMVM, and thus 
of our entire software solutions. Node-level performance engineering is here easily 
overcompensated by communication overhead. Unfortunately, the communication 
overhead is not reduced by standard matrix reordering strategies [61, 71, 84]. 
This problem can be partially alleviated by overlapping communication with 
computation, as in the spMMVM (see Sect.2), but a full solution to restore 
parallel efficiency is not yet available. Clearly, our different application scenarios 
still provide enough incentive to think about future numerical, algorithmic, and 
computational developments beyond the ESSEX project. 


6.2 New Applications in Nonlinear Dynamical Systems 


The block Jacobi—Davidson QR eigensolver in PHIST is capable of solving non- 
symmetric and generalized eigenvalue problems of the form 


Ax = ABx, (1) 


where B should be symmetric and positive definite. In [75], we exploited sev- 
eral unique features of this implementation to study the linear stability of a 
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three-dimensional reaction-diffusion equation: the Jacobian is non-symmetric, the 
preconditioner was implemented in Epetra (which can be used directly as a backend 
for PHIST), and the high degree of symmetry in the model yields eigenvalues with 
high geometric multiplicity (up to 24). We therefore use a relatively large block size 
of 8 for these computations to achieve convergence to the desired 20-50 eigenpairs 
(ài, Xi), with the real part of 4; near 0. In a recent Ph.D. thesis [74], the solver was 
also used for studying the linear stability of incompressible flow problems. Here B is 
in fact only semi-definite, and the preconditioner has to make sure that the solution 
stays in the ‘divergence-free space’, in which the velocity field satisfies V - u = 0 
and B induces a norm. 

Another ongoing effort concerning dynamical systems is the use of PHIST to 
parallelize the dynamical systems analysis tool PyNCT, which has as its main 
application the study of superconductors [82]. We have taken first steps to use 
PHIST as backend for the Python-based algorithms in PyNCT. Furthermore, it 
is possible to solve the eigenvalue problems arising in PyNCT directly by the 
BJDQR method in PHIST. Our goal here is the scalable parallel and fully automatic 
computation of bifurcation diagrams using PyNCT and any backend supported by 
PHIST. 

The Statistical Learning Lab led by Dr. Marina Meila at the University of 
Washington started to use the PHIST eigensolver to compute spectral gaps for 
Laplacian matrices obtained from conformation trajectories in molecular dynamics 
simulations, and other scientific data [14, 59]. These are symmetric positive definite 
matrices whose dimensions equal the number of simulation steps, typically of the 
order of n = 10°. When the data intrinsic dimension d is fixed, and much smaller 
than n (in our examples d < 10 ), the Laplacian is a sparse matrix. The sparsity 
pattern is not regular, and it is data dependent, as it reflects the neighborhood 
relationships in the data. Hence, in densely sampled regions rows will have many 
more non-zeros than in the sparsely populated regions of the data. In a manifold 
embedding algorithm, the eigengaps identify the optimal number of coordinates in 
which to embed the data. Furthermore, for data sizes n >> 10°, PHIST is used to 
compute the diffusion map embedding itself for the higher frequency coordinates 
for which existing methods are prohibitively slow. 


7 International Collaborations 


The internationalisation effort in the second phase of SPPEXA has fostered the 
ESSEX-II activities in several directions. First and foremost it amplified the sci- 
entific expertise in the project. Soon it became clear that complementing knowledge 
and developments could be leveraged across the partners. A specific benefit of 
the collaboration between German and Japanese partners is their very different 
background in terms of HPC infrastructures.Through close personal collaboration 
within the project all partners could easily access and use latest supercomputers on 
either side (see Sect. 4.2 and note that the BEAST framework has also been ported 
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to the K-computer). Together with the joint collaboration on scientific problems and 
software development a steady exchange evolved with many personal research visits 
which also opened up collaboration with partners not involved directly in ESSEX-IL. 

The results described in Sect.3.1 on preconditioners are a direct result of the 
collaboration of ESSEX-II with the ppOpen-HPC!® project led by Univ. of Tokyo. 
On the other hand, the CRAFT library developed at Univ. of Erlangen is utilized in 
an FEM code of Univ. of Tokyo and is part of a follow on JHPCN project with the 
German partner involved as associated partner. 

Collaboration between Japanese and German working groups made possible the 
expansion of the BEAST framework for projection based eigensolvers to include 
Sakurai—Sugiura methods. Various numerical and theoretical issues associated with 
the implementation of the solver within an iterative framework were resolved, and 
new ideas explored during research visits. Results based on this collaboration have 
so far been presented in multiple conferences and a paper in preparation [35]. 

The linear systems arising from numerical quadrature in the BEAST-C and 
BEAST-M framework were used in the testing and development of a Block 
Cholesky-based ILU preconditioner. The integration of an interface to this solver 
into BEAST has begun. Examining strategies and expectations for solving these 
extreme ill-conditioned problems was a point of intense discussion and collab- 
oration between working groups. One results was the development of RACE 
(see Sect.3.4). Beyond the discussion between several Japanese and German 
ESSEX-II partners also a strong collaboration with the Swiss partner (O. Schenk) of 
EXASTEEL-II evolved, who is an expert on direct solvers and graph partitioning. 
In this context also a collaboration with T. Iwashita (Hokkaido Univ., Japan) started 
in terms of hardware efficient coloring. 

Throughout the project, the variety of large matrices continuously added to 
the ScaMaC library allowed for testing with a variety of realistically challenging 
problems of both real and complex types in all ESSEX-II working groups. 
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Abstract This text presents contributions to efficient high-order finite element 
solvers in the context of the project ExaDG, part of the DFG priority program 1648 
Software for Exascale Computing (SPPEXA). The main algorithmic components 
are the matrix-free evaluation of finite element and discontinuous Galerkin operators 
with sum factorization to reach a high node-level performance and parallel scalabil- 
ity, a massively parallel multigrid framework, and efficient multigrid smoothers. 
The algorithms have been applied in a computational fluid dynamics context. The 
software contributions of the project have led to a speedup by a factor 3 — 4 
depending on the hardware. Our implementations are available via the deal.II finite 
element library. 
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1 Introduction 


Exa-scale performance of numerical algorithms is determined by two factors, node- 
level performance and distributed-memory scalability to thousands of nodes over an 
Infiniband-type fabric. Additionally, the final application efficiency in terms of time- 
to-solution is strongly influenced by the choice of numerical methods, where a high 
sequential efficiency is essential. The project ExaDG aims to bring together these 
three pillars to create an algorithmic framework for the next generation of solvers 
for partial differential equations (PDEs). The guiding principles of the project are as 
follows: 


Efficient discretization Efficient solvers 
few spatial unknowns, few iterations, 1.e., few 
few time steps operator evaluations 
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Efficient implementation: 
fast operator evaluation 


If we define the overall goal to be a minimum of computational cost to reach 
a predefined accuracy, this aim can be split into three components, namely the 
efficiency of the discretization in terms of the number of degrees of freedom (DoFs) 
and time steps, the efficiency of the solvers in terms of iteration counts, and the 
efficiency of the implementation [22]: 


accuracy 


computational cost 


accuracy 1 DoFs - timesteps - iterations 
~~ DoFs - timesteps iterations computational cost 
Po ees 
discretization  Solvers/preconditioners implementation 
(1) 


We define computational cost as the product of compute resources (cores, nodes) 
times the wall time resulting in the metric of CPUh, the typical currency of 
supercomputing facilities. 

Regarding the first metric, the type of discretizations in space and time are often 
the first decision to be made. Numerical schemes that involve as few unknowns 
and as few time steps as possible to reach the desired accuracy will be more 
efficient. This goal can be reached by using higher order methods which have a 
higher resolution capability, especially for problems with a large range of scales 
and some regularity in the solution [19]. However several possibilities and profound 
knowledge regarding the performance capability of potential algorithms on modern 
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hardware are still required to select those algorithms and implementations that are 
optimal with respect to the guiding metric of accuracy versus time-to-solution. 
High-order (dis-)continuous finite element methods are the basic building block of 
the ExaDG project due to their generality and geometric flexibility. 

Regarding the second metric, solvers are deemed efficient if they keep the 
number of iterations minimal. We emphasize that “iterations” are defined in a 
low-level way as the number of operator evaluations, which is also accurate when 
nesting several iterative schemes within each other. Note that we assume that large- 
scale systems must be addressed by iterative solvers; in a finite element context 
sparse direct solvers are not scalable due to fill-in and complex dependencies 
during factorizations. One class of efficient solvers of particular interest to ExaDG 
are multigrid methods with suitable smoothers, which have developed to be the 
gold standard of solvers for elliptic and parabolic differential equations over the 
last decades. Here, the concept of iterations would accumulate several matrix- 
vector products within a multigrid cycle that in turn is applied in an outer Krylov 
subspace solver. Due to the grid transfer and the coarse grid solver, such methods 
are inherently challenging for highly parallel environments. As part of our efforts 
in ExaDG, we have developed an efficient yet flexible implementation in the deal.II 
finite element library [1, 15]. 

Third, the evaluation of discretized operators and smoothers remains the key 
component determining computational efficiency of a PDE solver. The relevant 
metric in this context is the throughput measured as the number of degrees of 
freedom (unknowns) processed per second (DoFs/s). An important contribution of 
our efforts is to both tune the implementation of a specific algorithm, but more 
importantly to also adapt algorithms towards a higher throughput. This means that 
an algorithm is preferred if it increases the DoFs/s metric, even if it leads to lower 
arithmetic performance in GFlop/s or lower memory throughput in GB/s. Operator 
evaluation in PDE solvers only involves communication with the nearest neighbors 
in terms of a domain decomposition of the mesh, which makes the node-level 
performance the primary concern in this regard. Since iterative solvers only require 
the action of the matrix on a vector (and a preconditioner), they are amenable to 
matrix-free evaluation where the final matrix entries are neither computed nor stored 
globally in memory in some generic sparse matrix format (e.g., compressed row 
storage). While matrix-free methods were historically often considered because they 
lower main memory requirements and allow to fit larger problems in memory [8], 
their popularity is currently increasing because they need to move less memory: 
Sparse matrix-vector products are limited by the memory bandwidth on all major 
computing platforms, so a matrix-free alternative promises to deliver a (much) 
higher performance. 

The outline of this article is as follows. We begin with an introduction of matrix- 
free algorithms and a presentation of node-level performance results in Sect. 2. In 
Sect.3, we describe optimizations of the conjugate gradient method for efficient 
memory access and communication. Next, we detail our multigrid developments, 
focusing on performance numbers and the massively parallel setup in Sect. 4 and 
on the development of better smoothers in Sect.5. Application results in the field 
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of computational fluid dynamics are presented in Sect. 6, where the efficiency and 
parallel scalability of our discontinuous Galerkin incompressbile turbulent flow 
solver are shown. An extension of the kernels to up to 6D PDEs is briefly presented 
in Sect. 7. We conclude with an outlook in Sect. 8. 


2 Node-Level Performance Through Matrix-Free 
Implementation 


An intuitive example of a matrix-free implementation is a finite difference method 
implemented by its stencil rather than an assembled sparse matrix [33]. For finite 
element discretizations with sufficient structure of the underlying mesh and low- 
order shape functions, a small number of stencils allows to represent the operator of 
a large-scale problem [8]. Such methods are used in the German exascale project 
TerraNeo, utilizing the regular data structures in hierarchical hybrid grids and 
embedded into a highly scalable multigrid solver for Stokes systems [31, 32]. By 
suitable interpolations, the stencils can be extended from the affine coarse grid 
assumption to also treat smoothly deformed geometries and variable coefficients [7]. 

For higher-order methods, finite element discretizations lead to fat stencils, 
making the direct evaluation inefficient even when done through stencils. An 
alternative matrix-free scheme used in ExaDG is to not compute the explicit DoF 
coupling and instead turn to integrals underlying the finite element scheme. As an 
example, we consider the constant-coefficient Laplacian 


—-Vu=f inQ, u=g nð, (2) 
whose weak form in a finite-dimensional setting is 


(VG, Vun)a, = (Gi, A Qn (3) 


where up(x) = oe 1:n Gj (%)uj is the finite element interpolant of the solution 
with n degrees of freedom, g; denotes the test functions with i = 1,...,n, 
f is some right hand side, and Q; is the finite element representation of the 
computational domain Q. The left-hand side of this equation represents a finite 
element operator, mapping a vector of coefficients u = [u;]; to an output vector 
v = [v;]; by evaluating the weak form for all test functions g; separately. A matrix- 
free implementation is obtained by evaluating the element-wise integrals 
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by quadrature on ng points per cell K. Here, K denotes the elements in the mesh, 


& the coordinates of the reference element K = (0, 1), Jk denotes the Jacobian 
of the mapping from the reference to the real cell, and wg the quadrature weight. 
The operator Ig denotes the index mapping from Maof,ele element-local to global 
unknowns and defines the element-related unknowns u% = Igu. 

On element K, the formulation of Eq. (4) consists of two nested sums over the 
elemental unknowns ue , J € Ndof,ele, and the quadrature points q. The result is 
tested against all test functions g;, on the reference element, which are related to 
the global test functions g; through Ig. Since the metric terms do not depend on 
the shape function indices ix and j, and the sum over j does not depend on ix, 
the summations in the equation can be broken up into (1) an dng X Nndof,ele Matrix 
operation to evaluate the reference element derivative of u£? at the quadrature 
points, (2) the application of metric terms as well as other physics terms at ng 
quadrature points, and (3) an ndof,ele X dng matrix operation to test by all 7dof,ele 
test functions and perform the summation over the quadrature points. The separation 
of point-wise physics evaluation at quadrature points is a common abstraction in 
integration-based matrix-free methods [29, 41, 49, 50]. 

For high-order finite element methods, the naive evaluation would involve 
all shape functions at all quadrature points, which is of complexity O(k*“) for 
polynomials of degree k in d dimensions per element, or O(k“) per unknown, 
similarly to the fat stencil of the final matrix. 

At this point, the structure in the reference-cell shape function and quadra- 
ture points can be utilized to lower the computational complexity. If the multi- 
dimensional shape functions are the tensor product of 1D shape functions, and 
if the quadrature formula is a tensor product of 1D formulas, the so-called sum- 
factorization algorithm can be used to group common factors along the various 
dimensions and break down the work into one-dimensional interpolations. Figure | 
visualizes the process of computing the interpolation of nodal values, visualized 
by black disks, to the values at the quadrature points. Rather than using a naive 
interpolation of cost 2(k+ 1 operations, it can be done in 2d(k+ eis operations 
instead. In matrix-vector notation, the interpolation of the gradient with respect to 
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Fig. 1 Illustration of sum factorization for interpolation from node values on the left to the values 
in quadrature points (right) 
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x, evaluated at quadrature points, can be written as 


dun/Az, T®...@1@D 
dun/AX> I8... 8D QI (K) 
f = : [Sa 8...8 2 8 Si] u. (5) 
dunka Dg ®I®...@1 
Here, S;,..., Sg denote the ie x (k + 1) interpolation matrices from the nodal 
values to the quadrature points, obtained by evaluating the 1D basis at all 1D 
quadrature points, and D),..., Dg then 1D x 1D matrices of the derivatives of the 


Lagrangian basis in quadrature points. In this form, the multiplication by Kronecker 
matrices is implemented by small matrix-matrix multiplications. 

Sum factorization was initially developed in the context of spectral element 
methods by Orszag [61], see [19] for an overview of the developments. In [12], 
sum factorization was compared against assembled matrices with the goal to find 
the best evaluation strategy among assembled matrices and matrix-free schemes. 
For hexahedral elements considered in this work, the memory consumption and 
arithmetic complexity indicate that this is the case already for quadratic basis 
functions [11, 49], with a growing gap for higher polynomial degrees. 


2.1 Implementation of Sum Factorization in the deal.II Library 


As part of the ExaDG project, we have developed efficient implementations in the 
deal.II finite element library [1, 4] with the following main features, see [50] for a 
detailed performance analysis: 


support for both continuous [49] and discontinuous finite elements on uniform 
and adaptively refined meshes with hanging nodes and deformed elements, 
support for arbitrary polynomial expansions on quadrilateral and hexahedral 
element shapes as well as tensor product quadrature rules, 

minimization of arithmetic operations by using available symmetries, such as the 
even-odd decomposition [69] and a switch between the collocation derivative (5) 
for n!D ~ k + 1 quadrature points or an alternative variant based on derivatives 
of the original polynomials as used in [49] and discussed in [29], 

flexible implementation of operations at quadrature points, 

vectorization across several elements to optimally use SIMD units (AVX, AVX- 
512, AltiVec) of modern processors, 

applicability to modern multi-core CPUs as well as GPUs [51, 57], 

data access optimizations such as element-based loops for DG elements [50, 56], 
and MPI implementation with tight data exchange as well as MPI-only and 
shared-memory models [43, 48, 54]. 
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The concept of matrix-free evaluation with sum factorization has been widely 
adopted by now, like in the deal. II [1], DUNE [5, 40, 60], Firedrake [63], mfem [2], 
Nek5000 [28] or Nektar++ [13] projects. These fast evaluation techniques are 
directly applicable to explicit time stepping schemes, as we have demonstrated 
for wave propagation in [42, 53, 65-68] and the compressible Navier—Stokes 
equations [24]. The proposed developments make matrix-free evaluation of high- 
order DG operators reach a throughput in unknowns per second almost as high as for 
optimized 5-wide finite difference stencils in a CFD context [75], despite delivering 
much higher accuracy. 


2.2 Efficiency of Matrix-Free Implementation 


In Fig.2, we give an overview of the achieved performance with our framework 
applied to the discontinuous Galerkin interior penalty (IPDG) discretization of the 
3D Laplacian on an affine geometry. The most advanced implementation presented 
in [50] is used, namely a cell-based loop with a Hermite-like basis for minimal 
data access [56]. The figure lists the throughput, which is measured by recording 
the run time of the matrix-vector product in an experiment with around 50 million 
DoFs (too large to fit into caches), and reporting the normalized quantity DoFs/s 
obtained by dividing the number of DoFs by the measured run time. The code is 
run on a single node of six dual-socket HPC systems from the last decade with a 
shared-memory parallelization with OpenMP, threads pinned to logical cores with 
the close affinity rule, and using streaming stores to avoid the read-for-ownership 
data transfer [33] on the result vector. As systems, we consider a 2 x 8 core AMD 
Opteron 6128 system from 2010, a 2 x 8 core Intel Xeon 2680 Sandy Bridge from 
2012 (as used in the SuperMUC phase 1 installation in Garching, Germany), a 2 x 8 
core Intel Xeon 2630 v3 (Haswell) representing a medium-core count chip from 
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Fig. 2 Throughput of matrix-free evaluation of the IPDG discretization of the 3D Laplacian on an 
affine grid 
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2014, a2 x 14 core Intel Xeon 2697 v3 (Haswell) representing a high-core count 
chip of the same generation (as used in the SuperMUC phase 2 installation), a 2 x 
20 core Intel Xeon 2698 v4 (Broadwell) system from 2016, and a 2 x 24 core 
Intel Xeon Platinum 8174 from 2017, labeled ‘Skylake’ in the remainder of this 
work, and installed in the SuperMUC-NG supercomputer. The chips are operated 
at 2.0 GHz, 2.7 GHz, 2.4 GHz, 2.6 GHz, 2.2 GHz, and 2.3 GHz, respectively, and all 
run with fully populated memory interfaces. The Intel machines are run with 2-way 
hyperthreading, e.g. with 96 threads for the Xeon Platinum Skylake. 

The throughput results in Fig.2 demonstrate the advancements of hardware 
during the last decade. In particular the increased width of vectorization, from 2 to 4 
doubles with Sandy Bridge and from 4 to 8 doubles with Skylake, are clearly visible. 
Furthermore, the comparison between Sandy Bridge and the smaller Haswell system 
reveals the benefit of fused multiply-add (FMA) instructions and higher L1 cache 
bandwidth of the latter: For low polynomial degrees with a modest number of 
FMA instructions, Sandy Bridge with its higher frequency can approximately 
deliver the same performance as Haswell. As the polynomial degree is increased, 
the arithmetic work is increasingly dominated by FMAs in the sum factorization 
sweeps similar to (5) as shown in [50], and Haswell pulls ahead. Finally, while 
we observe a throughput of up to 5.7 billion DoFs/s on Skylake (with up to 1.35 
TFlop/s for k = 8), we observe a relatively strong decrease of performance for 
polynomial degrees k > 13: This is because the vectorization across elements 
leads to an excessive size of the temporary data within sum factorization—here, 
a different vectorization strategy could lead to better results. However, we consider 
the polynomial degrees 3 < k < 8 most interesting for practical simulations, where 
almost constant throughput in terms of DoFs/s is reached. This somewhat surprising 
result, given the expected O(1/k) complexity of throughput for sum factorization, is 
because face integrals and memory access with an O(1) complexity are dominant. 
Compared to our initial implementation in 2015, which achieved a throughput of 
0.32 billion DoFs/s on Sandy Bridge with degree k = 3, the progress in software 
technologies allowed us to reach 1.02 billion DoFs/s on the same system. For Intel 
Skylake, where memory access is more important, the software progress of our 
project is more than 4x. 

Figure 3 shows the throughput normalized by the number of cores for polynomial 
degree k = 4 over the different hardware generations. For operator evaluation 
with discontinuous elements and face integrals, approximately 200 floating point 
operations per unknown are involved with our optimized implementations [50]. At 
the same time, we must access at least 16 byte (read one double, write one double) 
plus some neighbors that are not cached, so the arithmetic intensity is around 8—12 
Flop/Byte, close to the machine balance of the Skylake Xeon. This means that both 
memory bandwidth and arithmetic performance are relevant for performance (on 
one Skylake node, we measured memory throughput of around 160 GB/s, compared 
to the STREAM limit of 205 GB/s). Likewise, continuous elements evaluated on 
an affine mesh have seen a considerable increase in throughput per core (arithmetic 
intensity of 7 Flop/Byte). However, the improvement has been much more modest 
for continuous elements evaluated on curved elements. In this setting, separate 
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Fig. 3 Evolution of throughput of matrix-vector product per core (computed on a fully populated 
node and divided by the number of cores) with matrix-free evaluation with k = 4 versus a sparse 
matrix-vector product with continuous linear elements on various hardware 


metric terms for all quadrature points and all elements are needed (as opposed to 
a single term per element in the affine mesh case), reducing the arithmetic intensity 
to around 1.2 Flop/Byte.! 

Figure 3 also contains the evolution of performance of a sparse matrix-vector 
product for tri-linear continuous finite elements. The performance is much lower 
due to the aforementioned memory bandwidth limit, and has hardly improved per 
core on Skylake over the dated Opteron architecture. This illustrates the effect of 
the so-called memory wall. We emphasize that the sparse matrix-vector product for 
k = 1 is more than three times slower than even the matrix-free evaluation for k = 4 
on curved elements. Hence, high-order methods with matrix-free implementations 
are faster per unknown on newer hardware, in addition to their higher accuracy. 


3 Performance-Optimized Conjugate Gradient Methods 


The developments of matrix-free implementations presented in the previous section 
result in a throughput for evaluation of the IPDG operator in Fig.2 of up to 5.7 
billion DoFs/s on Skylake. This is equivalent in time to the mere access of 4.5 
doubles per DoF (either reading or writing). In other words, our developments 
have made the operator evaluation so fast that the matrix-vector product may no 
longer be the dominant operation in algorithms like the conjugate gradient (CG) 
method preconditioned by the diagonal, or Chebyshev smoothers. These algorithms 


'The merged final coefficient tensor TTR det(T. K)Wg is used for the present results, i.e., 6 
doubles per quadrature point [29, 51]. 
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involve access to between 6 and 18 vectors for vector updates, the application of the 
diagonal preconditioner, and inner products. For optimal application performance 
it is therefore necessary to look into the access to vectors. As proposed in our 
work [51, 56, 65], merging the vector operations can improve throughput by up 
to a factor of two, and in particular for the DG case with cell-based loops which 
allow for a single pass through data [48, 56]. Fusion of different steps of a scheme 
has also been proposed for explicit time integrators in [14]. 

For the assessment of optimization opportunities on the algorithm level that 
goes beyond the matrix-vector product, we consider a high-order finite element 
benchmark problem suggested by the US exascale initiative “Center for Efficient 
Exascale Discretization” (CEED). The benchmark involves a continuous finite 
element discretization of the Laplacian (3), using matrix-free operator evaluation 
within a conjugate gradient solver preconditioned by the matrix diagonal. In this 
study, we consider the case BP5 [29], see also https://ceed.exascaleproject.org/bps/, 
which integrates the weak form (4) of polynomial degree k using a Gauss—Lobatto 
quadrature formula with n r = k + 1 quadrature points on a cube with deformed 
elements. While this integration is not exact, it is the typical spectral element setup 
with an identity interpolation matrix $; = J in Eq. (5). 

Figure 4 lists the contributors to run time for the plain conjugate gradient method 
preconditioned by the point-Jacobi method as a function of the problem size for 
the polynomial degree k = 6. Here, the metric terms Jg are computed on the 
fly from a tri-linear representation of the geometry. Three different performance 
regimes can be distinguished in the graph: To the left, there is not enough parallelism 
given the domain decomposition on 48 MPI ranks and batches of 8 elements due to 
vectorization—indeed, at least 85,000 DoFs are needed to saturate all cores and 
SIMD lanes. Furthermore, the synchronization barriers due to the inner products in 
the conjugate gradient method also lead to a slowdown. As the problem size and 
parallelism increase, the run times decrease significantly and reach a minimum for 
a problem size around one million DoFs. Here, all data involved in the algorithm 
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Fig. 4 Breakdown of times per CG iteration in CEED benchmark problem BPS [29] for the plain 
conjugate gradient method with k = 6 on one node of dual-socket Intel Skylake 
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Fig. 5 Study of merged vector operations for conjugate gradient solver for the CEED benchmark 
problem BP5 [29] on one node of dual-socket Intel Skylake for k = 6 


fits into the approximately 110 MB of L2+L3 cache on the processors. As the size 
is further increased, caches are exhausted and most data must be fetched from slow 
main memory. As a consequence, the run time of the solver increases significantly, 
and the vector updates, the diagonal preconditioner, and the inner products take a 
significant share. Note that all vector operations use the hardware optimally with a 
memory throughput of 205 GB/s. 

In order to improve performance, we have therefore developed conjugate gradient 
implementations with merged vector operations by loop fusion. Figure 5 compares 
three variants of the conjugate gradient solver: the plain conjugate gradient method 
runs all vector operations through high-level vector interfaces with separate loops 
for addition and inner products. In the “merged dot products”, we have merged 
the dot product p'A p following the matrix-vector product into the loop over the 
elements, and merged the vector updates to the residual and solution with the dot 
product for ri Pir. Here, r denotes the residual vector, p the search direction of 
the conjugate gradient method, A the matrix operator (represented in a matrix-free 
way), and P~! the diagonal preconditioner. However, the improvements with this 
algorithm are relatively modest. 

Much more performance can be gained by creating a conjugate gradient variant 
we call “fully merged”: Here, each CG iteration performs a single loop through 
all vector entries and ideally reads 5 vectors (solution, residual, search direction, 
temporary vector to hold the matrix-vector product, and diagonal of preconditioner) 
and writes four (solution, residual, search direction, temporary vector). All vector 
updates of the previous CG iteration are scheduled before the matrix-vector product 
and all inner products are scheduled after the matrix-vector product. The vector 
operations are interleaved with the loop over elements, ensuring that dependencies 
due to the access pattern of the loop and the MPI communication are fulfilled (this 
leads to slightly more access in practice). This approach applies the preconditioner 
several times with partial sums to construct the inner products with a single 
MPI Allreduce, trading some local computations for the decreased memory 
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access. Of course, fusing the preconditioner into the loop assumes that it is both 
cheap to apply and does not involve long-range coupling between the DoFs. The 
results in Fig.5 show that performance in the saturated limit, i.e., for large sizes 
beyond 107 DoFs, is 2.5 times faster than with the plain CG iteration. Interestingly, 
this also improves performance for the sizes fitting into caches, which is due to less 
synchronization and reducing access to the slower L3 cache. 

To put the performance of the fully merged case on Intel Skylake into perspective, 
we compare with executing the plain CG method on an Nvidia V100 GPU using the 
implementation from [51, 57]: even though the GPU runs with around 700 GB/s of 
memory throughput, the performance is higher on Intel Skylake with only 200 GB/s 
from RAM memory because the merged loops significantly increase data locality. 
Furthermore, on the GPU we do not compute the metric terms on the fly, but 
load a precomputed tensor Tg Tg IT, K)W, which is faster due to reduced 
register pressure, see also the analysis for BPS in [71]. We also note that the 
GPU results with our implementation are faster than an implementation with the 
OCCA library described in [29] with up to 0.6 billion DoFs/s on a V100 of the 
Summit supercomputer. The reason is that our implementation uses a continuous 
finite element storage that does not duplicate the unknowns at shared vertices, edges 
and faces, which reduces the memory access by about a factor of two. Furthermore, 
the results from [29] involve a separate gather/scatter step with additional memory 
transfer to enforce continuity, while this is part of the operator evaluation within a 
single loop in our code. 

Figure 6 lists the achieved throughput with a fully merged conjugate gradient 
solver for polynomial degrees k = 2,..., 8, the most interesting regime for our 
solvers. We use a tri-linear representation of geometry and compute the geometric 
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Fig. 6 Throughput of the CEED benchmark problem BP5 [29] on one node of dual-socket Intel 
Skylake for k = 2,..., 8 with fully merged conjugate gradient solver 
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Fig. 7 Study of geometry representation for the CEED benchmark problem BP5 [29] on one node 
of dual-socket Intel Skylake with k = 6 for matrix-vector product only (left) and with fully merged 
vector operations in the conjugate gradient solver (right) 


factors on the fly. Throughput is somewhat lower for quadratic and cubic elements 
because the geometry data located in the vertices is still noticable. 

The results in Fig. 3 motivate the analysis of the representation of the geometry 
in the matrix-vector product, with results presented in Fig. 7. The figure lists both 
the throughput of the matrix-vector product in the left panel and the throughput 
of the complete CG iteration with merged vector operations. Highest performance 
is obtained for the affine mesh case where our implementation can compress the 
memory access of the Jacobian. While this case is excluded from the CEED BP5 
specification that requires a deformed geometry [29], it is an interesting baseline 
to compare against. Using separate tensors for each quadrature point, “variable 
tensor cached”, is equally fast as the affine case as long as data fits into caches. 
However, performance drops once the big geometric arrays must be fetched from 
main memory. For the case the geometry is computed on the fly from a tri-linear 
representation of the mesh, i.e., the vertices, the matrix-vector product is slower 
than the affine variant. For the conjugate gradient solver, however, we observe that 
the two reach essentially the same performance for five million and more DoFs, 
as they are both limited by the memory bandwidth from vector access. The “tri- 
linear compute” case involves a higher Flop/s rate with almost 700 GFlop/s, as 
compared to the throughput of 330 GFlop/s for the affine mesh case. This means 
that the merged vector operations allow us to fit additional computations behind the 
unavoidable memory transfer without affecting application performance. Finally, 
an isoparametric representation of the geometry (labeled “isopara compute” in 
Fig. 7) can also be computed on the fly by sum factorization from a kth degree 
polynomial [50]. While this case is obviously slower than the precomputed variable- 
tensor case from caches, it leverages higher performance when data must be fetched 
from main memory. 
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Fig. 8 Throughput of the CEED benchmark problem BPS [29] with k = 6 on up to the full 
SuperMUC-NG machine. Throughput normalized per node 


The tri-linear and isoparametric cases are not equivalent, as only the latter 
represents higher order curved boundaries. Intermediate polynomial degrees for the 
geometry are conceivable, which would land between the two in terms of application 
throughput. To combine the higher performance of the former, we plan to investigate 
the tradeoffs in more detail in the future, e.g. by using a k-degree representation on 
a single layer of elements at the boundary and a tri-quadratic representation in the 
domain’s interior. 

Finally, Fig. 8 shows the weak scaling of the BP5 benchmark problem up to the 
full size of the SuperMUC-NG machine with 6336 nodes and 304,128 cores. The 
data is normalized by reporting the number of DoF per node, so ideal weak scaling 
would correspond to coinciding lines. While the saturated performance is scaling 
well, giving a sustained performance of up to 4.4 PFlop/s,* most of the in-cache 
performance advantage is lost due to the communication latency over MPI, see 
also [62] for limits with MPI in PDE solvers. Defining the strong scaling limit as the 
point where throughput reduces to 80% of saturated performance [29], it is reached 
for wall times of 56 us on 1 node. On 512 nodes, the strong scaling limit is already 
around 180 us, whereas it is 245 us on the full SuperMUC-NG machine. Note that 
even though most optimizations presented in this section have addressed the node- 
level performance, we have also considered the strong scaling in our work—indeed, 
the strong scaling on SuperMUC-NG is excellent with a limit around 5 times lower 
than the BlueGene-Q results presented in [29]. 


?The LINPACK performance of SuperMUC-NG according to the top500 list is 19.4 PFlop/s. 
Considering that we use an iterative solver for PDE with optimization of throughput, this is an 
extremely good value. 
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4 Geometric Multigrid Methods in Distributed 
Environments 


Multigrid methods are efficient solvers for the linear systems arising from the 
discretization of elliptic problems, see [30] for a recent efficiency evaluation 
and [35] for a projection of elliptic solver performance to the exascale setting. They 
apply simple iterative schemes called smoothers on a hierarchy of coarser problem 
representations. On each level of the hierarchy, the smoothers address the high- 
frequency content of the solution by smoothening the error. On a sufficiently coarse 
level with a small number of unknowns, a direct solver can be applied. The multigrid 
algorithm can be realized by a V-cycle as illustrated in Fig. 9 or some related cycle 
(W-cycle or F-cycle). In the matrix-free high-order finite element context, variants 
of the Chebyshev iteration around a simple additive scheme, such as point-Jacobi or 
approximate block-Jacobi with some rank-d approximation of the cell matrix, are 
state of the art. The results in this section are based on this selection. Overlapping 
Schwarz schemes are a new development detailed in Sect. 5 below. 

In terms of finding the coarser representations for the multigrid hierarchy, high- 
order finite element and discontinuous Galerkin methods permit a range of options. 
The hierarchy can both be constructed by coarser meshes (i-multigrid), by lowering 
the polynomial degree (p-multigrid), by a discontinuous-continuous transfer as well 
as algebraically based on the matrix entries only (algebraic multigrid). The latter do 
not fit into a matrix-free context, since they explicitly rely on a sparse matrix and 
also often are not robust enough as the degree increases. As it has been shown by 
the work [70], scalability to the largest supercomputers is much more favorable if 
knowledge about coarsening by a mesh can be provided. In other words, geometric 
multigrid is to be preferred over algebraic multigrid in case there is such structure 
in the problem. 


smoothen smoothen 
level £ = 2 
restrict prolongate 
smoothen © smoothen level € = 1 
restrict prolongate 
level £ = 0 


coarse solve 


Fig. 9 Illustration of multigrid V-cycle with smoothing on each level and restriction/prolongation 
between the levels (left) and exemplary partitioning of a grid with adaptive refinement partitioned 
among 3 processors. The partitioning of the active cells is shown in the mid panel and on the 
various multigrid levels on the right panel 
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For these reasons, we have developed a comprehensive geometric multigrid 
framework with deal.II. In [27], a hybrid multigrid solver with all possibilities of h-, 
p-, and algebraic coarsening has been combined in a flexible framework, with the 
possibility to perform an additional c-transfer from discontinuous to continuous 
function spaces for the DG case. In terms of the h-MG method on adaptive meshes, 
the deal II library implements the local smoothing algorithm [9, 36, 37] where 
smoothing is done level by level. Our work [15] developed a communication- 
efficient coarsening strategy for this setup, at the cost of a load imbalance for 
smoothing on the multigrid levels with adaptively refined meshes. The tradeoffs 
in this choice and the associated costs have been quantified by a performance model 
in [15]. 

Figure 10 shows the results of two strong scaling experiments of the multigrid 
V-cycle with the h-multigrid infrastructure of the deal.II library. The uniform grid 
and a typical adaptively refined case are compared for the same problem size of 137 
million and 46 billion DoFs, respectively, see [15] for details on the experiment. 
Differences in run time are primarily due to the load imbalance for the level 
operations. The results demonstrate optimal parallel scaling of both the uniform 
and adaptively refined cases down to around 107? s, with a slightly better strong 
scaling of the adaptive case due to the slower baseline. This performance barrier— 
typical for strong scaling of multigrid schemes in general—can be explained by 
the specific type of global communication in this algorithm: from the fine mesh 
level with many unknowns distributed among a large number of cores, we transfer 
residuals to coarser meshes with restriction operators until the coarse grid solver is 
either run on a single core or with few cores in a tightly coupled manner. Then, the 
coarse-grid corrections are broadcast during prolongation, involving all processors 
again. The communication pattern of a multigrid V-cycle thus relates to a tree-based 


approx 16.9M cells/137.4M DoFs approx 5.8B cells/46.4B DoFs 
10! 
—e— uniform —e— uniform 
10° —— adaptive —— adaptive 
=-=- model =-=- model 
A 7 2, 
-E E 
2 2 
S] o 0 
2 107 ee 
> a A > 
107° "a 
64 256 1024 4k 16k 65k 16k 32k 65k 
Number of cores Number of cores 


Fig. 10 Strong scaling of geometric multigrid V-cycle for 3D Laplacian on uniform and adaptively 
refined mesh using continuous Qz elements with matrix-free evaluation on up to 4096 nodes (64k 
cores) of 2 x 8 core Intel Sandy Bridge (SuperMUC phase 1). Adapted from [15] 
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implementation of MPI_Al1lreduce, with the difference that the communication 
tree is induced by the grid and substantial operations, namely smoothing and level 
transfer, are intermixed with the communication. In this particular case, nine matrix- 
vector products with nearest-neighbor communication are performed per level (eight 
in the smoother and one for the residual before restriction). In addition, two vertical 
nearest-neighbor exchange operations are done in restriction and prolongation. A 
typical matrix-vector product with up to 26 neighbors takes around 1074s on the 
chosen Intel Sandy Bridge system when run on a few thousands of nodes [52]. When 
done on seven levels plus the coarse mesh for the uniformly refined 137 million 
DoFs case, the expected saturated limit of around 8 ms is exactly seen in the figure. 
On the newer SuperMUC-NG machine, a latency barrier per V-cycle of around 2- 
4ms per V-cycle has been measured, depending on the number of matrix-vector 
products for the level smoothers. This limit is attractive compared to alternative 
solvers for elliptic problems such as the fast multipole method or the fast Fourier 
transform [30, 35]. 

Multigrid schemes are at the heart of incompressible flow solvers through the 
pressure Poisson equation, as detailed in Sect.6 below. Applications of matrix- 
free geometric multigrid to continuum mechanics were presented in [18] and to 
electronic calculations with sparse multivectors in [16, 17]. 

As an example of the large-scale suitability of the developed multigrid frame- 
work, Fig. 11 shows two scaling experiments on the SuperMUC-NG supercomputer 
with up to 304,128 cores of Intel Skylake. Black dashed lines denote ideal strong 
scaling along a line and weak scaling with a factor of 8 between the lines. The com- 
putational domain is a cube meshed by hexahedral elements, using the affine mesh 
code path for matrix-free algorithms discussed in Sect.2. A consistent Gaussian 
quadrature with n a = k + 1 points is chosen. We run a conjugate gradient solver 


to a relative tolerance of 107° compared to the initial unpreconditioned residual. 
This setup is motivated by applications where a very good initial guess is already 


Continuous Galerkin, degree k = 4 Discontinuous Galerkin, degree k = 5 


Solver time [s] 


48 192 768 3.1k 12k 49k 152k 96 384 1536 6.1k 25k 98k 304k 
Number of cores Number of cores 


Fig. 11 Multigrid strong scaling analysis for tolerance 107° with 2 CG iterations 
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available, e.g. by extrapolation of solutions from the old time step [22, 45], and only 
a correction is needed. More accurate solves are obtained by tighter tolerances or 
by full multigrid setups [51]. The multigrid V-cycle is run in single precision to 
increase throughput, together with a double precision correction through the outer 
CG solver. This setup has been shown in [51] to increase throughput by around 1.8 x 
without affecting the multigrid convergence. 

In the left panel of Fig.11, we present results for a continuous Galerkin 
discretization with a polynomial degree k = 4. A pure geometric coarsening 
down to a single mesh element is used. A Chebyshev iteration of degree five 
based on the matrix diagonal, i.e., point Jacobi, is used on all levels for pre- and 
post-smoothing. The maximal eigenvalue Amax is estimated by 15 iterations of a 
conjugate gradient solver and the Chebyshev parameters are set to smoothen in a 
range [0.06Amax , ees aoe ]. As a coarse solver, we use a Chebyshev iteration with 
the degree set to reduce the residual by 10° in terms of the Chebyshev a-priori error 
estimate [74]. We observe ideal weak scaling and strong scaling to around 107? s. 
More importantly, the absolute run time is excellent: For instance, the 8.6 billion 
DoF case on 1536 cores is solved in 1.45, i.e., 4.0 million DoFs are solved per core 
per second. 

The right panel of Fig. 11 shows the result for multigrid applied to an IPDG 
discretization with k = 5. Here, we use a transfer from the discontinuous space to 
the associated continuous finite element space with k = 5 on the finest mesh level 
(see [3] for the theoretical background and [27] for the multigrid context) and then 
progress by h-coarsening to a single element. On the DG level, we use a Chebyshev 
smoother around a block-Jacobi method, with the block-Jacobi problems inverted 
by the fast diagonalization method [58]. On all continuous finite element levels, 
a Chebyshev iteration around the point-Jacobi method is used. The degree of the 
Chebyshev polynomial is six. This solver setup achieves a multigrid convergence 
rate of about 0.025, i.e., reduces the residual by 3 orders of magnitude with two 
V-cycles. If used in a full multigrid setting [51], a single V-cycle on the finest 
level would suffice to solve the problem to discretization accuracy. Merged vector 
operations with a Hermite-like basis for the Chebyshev iteration are used according 
to [56]. The final application performance of the largest computation on 1.9 trillion 
DoFs is 5.9 PFlop/s, with 5.6 PFlop/s done in single precision and 0.27 PFlop/s 
in double precision. The limiting factor is mostly memory transfer, however, with 
an application throughput of around 175 GB/s per node (the STREAM limit of one 
node is 205 GB/s). 


5 Fast Tensor Product Schwarz Smoothers 


In Sect.4, we have discussed a scalable implementation of geometric multigrid 
methods, obtaining an efficient solver in the sense of cost per iteration. It employs 
the matrix-free operator implementation from Sect. 2 in order to reduce the compu- 
tational cost for residuals and grid transfer. The missing building block for our cost 
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model in Eq. (1) is an efficient implementation (in terms of computational cost per 
DoF) of an efficient smoother (in terms of number of multigrid iterations). 

The main challenge consists of finding preconditioners whose cost is similar to 
operator evaluation. So far, we have discussed Chebyshev smoothers, which can be 
implemented matrix-free in a straight-forward fashion. Alas, their performance is 
not robust for higher order elements. Likewise, from an arithmetic cost point of view 
sparse matrices can be competitive at most for moderate polynomial degrees k = 
2, 3 [55] or when done via auxiliary spaces of linear elements on a subdivided grid 
using some matrix-based preconditioner. However, Fig.3 shows that even sparse 
matrices for linear elements are up to 10 times slower than the matrix-free operator 
evaluation. It seems that only the two SPPEXA projects ExaDUNE and ExaDG 
have addressed this question in [6, 76]. While [6] focuses on iterative solution of 
cell problems for multigrid smoothing, we consider domain decomposition based 
smoothers in the form of multilevel additive and multiplicative Schwarz methods 
based on low-rank tensor approximations. They consist of a subdivision of the mesh 
on each level into small subdomains consisting either of a single cell, or of the patch 
of cells sharing a common vertex. On each of these subdomains, local finite element 
problems are solved. Comparing with operator application, these smoothers share 
the structural property of evaluation of local operators on mesh cells or on a patch 
of cells. They differ by the fact that the smoothers involve local inverses instead of 
local forward operators, and that these local inverses in general are not amenable 
to a tensor decomposition like sum factorization. There is one exception though, 
namely separable differential operators. In d dimensions these can be written in the 
form 


L=1g@---@h@l+---+Lg@l-1®::-Oh, (6) 


where Lg are one-dimensional differential operators and J; are identity operators 
in directions k = 1,..., d. This representation transfers to finite element operators 
with tensor product shape functions in a straight-forward way, reading J; as one- 
dimensional mass matrices Mx. 

Due to [58], the inverse of L can be represented as the product 


LAO), (7) 


with the diagonal matrix A = Ig ®---@In @ A, +---+Aqg@ Ig_-1 ®::-@h, 
where J; denote identity matrices, and a rank-1 decomposition Q = Q7®---®Q}. 
The tensor factors are obtained by solving d generalized eigenvalue problems 


Ar = QF Lk Qk, 


i (8) 
Ik = Qy Mx Qk, | aad ear 


Thus, the computational effort for computing the inverse has been reduced from 
O(k*) to O(dk?) and for the application of local solvers from O(k7“) to O(dk¢*!) 
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by exploiting sum factorization. Based on this technique, we have implemented a 
geometric multigrid method in [76] based on earlier work in [37-39]. 


5.1 The Laplacian on Cartesian Meshes 


In order to test our concept and to obtain a performance baseline for more 
complicated cases, we first attend to the case where the decomposition described 
above can be applied in a straightforward way, namely the additive Schwarz method 
with subdomains equal to mesh cells. As Table 1 shows, it yields an efficient 
preconditioner with less than 25 conjugate gradient steps for a gain of accuracy 
of 108. While it is uniform in the mesh level, it is not uniform in the polynomial 
degree due to the increasing penalty parameter of the interior penalty method. The 
computational effort for a smoothing step based on local solvers in the form (7) 
is below the effort for a matrix-free operator application for polynomial degrees 
between 3 and 15 in three dimensions because it only involves operations on cells. 
The setup time for computing Q and A is even less. Thus, in the context of the 
performance analysis of the conjugate gradient method in Sect. 2, it barely adds to 
the cost per iteration step, but reduces the number of matrix-vector products when 
comparing to the accumulated numbers within a Chebyshev/point Jacobi method, 
and almost independently of polynomial degree. 

In view of application to incompressible flow, we also study vertex patches as 
typical subdomains for smoothing. First, we observe that a regular vertex patch with 
24 cells attached to a vertex inherits the low-rank tensor product structure from 
its cells, possibly after renumbering due to changes in orientation. Thus, we can 
apply the same method as on a single cell, resulting effectively in a factor 2% in the 
complexity estimates above. Patches around vertices with irregular topology like 3 
or 5 cells in two dimensions do not possess a tensor product structure. Fortunately, 


Table 1 Fractional CG 
iterations, preconditioned by 
h-MG with additive Schwarz 


Levels | Convergence steps 
2D k=3 |k=4 |k=7 |k=10 


smoother on cells 7 14.5 |143 |18.8 | 20.9 
8 14.5 14.3 18.8 20.9 
9 [145 |143 [18.8 |209 
10 14.5 14.3 18.8 20.9 


3D [k=3 [k=4 |k=7 |k=10 
3 16.7 [168 |22.0 | 24.5 

4 17.1 [17.0 [220 [24.5 
5 
6 


17.2 |17.0 |22.1 | 246 
17.1 17.0 | 22.1 24.7 


Relative solver tolerance of 1078 and relax- 
ation parameter © = 0.7 
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on meshes obtained by refinement of a coarse mesh, they are all determined by 
irregularities of the coarse mesh and thus small in number. 

Vertex patches lead to overlapping decompositions with overlap of at least 4 
and 8 in two and three dimensions, respectively. From the analysis of Schwarz 
methods, it becomes clear that a multiplicative method is required for highest 
multigrid convergence rates. In order to parallelize such a smoother and to avoid 
race conditions, mesh cells are colorized, that is, they are separated into “colors” 
such that patches of the same “color” do not share any common face or cell. 
As a consequence, the multiplicative method coincides with an additive method 
within each color, such that we can execute the local solvers in parallel within each 
color, and the colors sequentially. Typical convergence results for the Laplacian are 
reported in Table 2, suggesting that this scheme is almost a direct solver. 

The vertex patch has 4 and 8 times as many unknowns as a single mesh cell 
in two and three dimensions, respectively. Thus, the effort for a smoothing step 
with 16 colors and the optimizations described above turns out to be about 20 to 
24 times the effort of a matrix-free operator application, measured over polynomial 
degrees from 3 to 15. This seems excessive at first glance, but it must be kept in mind 
that the Chebyshev smoother of degree 6 used in Fig. 11 also involves 12 matrix- 
vector products for pre- and post-smooting. Futhermore, the current scheme comes 
with a reduction of the number of steps by a factor 10 compared to the additive 
cell smoother for the Laplacian, which makes it almost competitive [76]. Finally, 
the iteration counts are independent of the polynomial degree, making the scheme 
attractive for higher degrees. Moreover, we point out that this smoother also allows 
for the solution of a Stokes problem in four iteration steps [39]. 


Table 2 Fractional GMRES iterations, preconditioned by h-MG with multiplicative Schwarz 
smoothers on vertex patches 


Levels | Convergence steps Colors || Levels | Convergence steps Colors 
2D k=3|k=4|k=7|k=10 2D k=3|k=4|k=7|k=10 

7 12.5 125 |21 [21 8 2.9 |29 |26 |2.5 17 
8 2.5 |25 |21 |20 8 8 2.9 |29 |26 |2.5 17 
9 2.5 |24 |2.1 |2.0 8 2.9 |29 |26 |2.5 17 
10 [2.5 |24 |20 |2.0 8 10 29 |29 |26 |24 17 
3D |k=3|k=4|k=7 k=10 3D k=3|k=4|k=7|k=10 
3 24 |25 |21 1.8 16 3 2.6 |27 |24 |24 35 
4 24 |25 |2.1 1.9 16 4 2.8 |2.8 |2.5 | 2.4 49 
5 [2.4 |25 |21 1.9 16 5 2.8 (28 (25 |24 |51 
6 24 |25 |21 | 19 16 6 2.8 (2.8 |2.5 | 2.4 52 


Based on minimal coloring (left) and graph coloring (right) with a relative solver tolerance of 1078 
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5.2 General Geometry 


As soon as the mesh cells are not Cartesian anymore, the special structure of 
separable operators in (6) is lost and the inverse cannot be computed according 
to (7). In this case, we have two options: solving the local problems iteratively, 
as in [6], or approximately. A possible approximation which recovers the situation 
of the previous subsection consists of replacing a non-Cartesian mesh cell by an 
approximating (hyper-)rectangle, then inverting the separable differential operator 
on the rectangle (omitting the prefix hyper from here on). 

Such a surrogate rectangle can be obtained from the following procedure: first, 
we compute the arc length of all edges. From these, we obtain the length of the 
rectangle in each of its natural directions by averaging over all parallel edges (in 
a topological sense). Thus, the geometry of the rectangle is determined up to its 
position and orientation in space. Given the fact that the Laplacian is invariant under 
translation and rotation, these do not matter and we can choose a rectangle centered 
at the origin with edges parallel to the coordinate directions. Different differential 
operators may require different approximations here. 

The convergence theory of Schwarz methods allows for inexact local solvers as 
long as they are spectrally equivalent. Naturally, the deviation from exactness enters 
into the convergence speed of the method. Additionally, inexact local solvers can 
amplify the solution, such that a smaller relaxation parameter may be necessary. 
This is exhibited in Table 3, where we compare the efficiency of multigrid with 
exact local solvers and the method with surrogate rectangles as described above. 
We see that a reduction of the relaxation parameter © = 0.7 for exact local solvers 
to © = 0.49 is necessary for robust convergence. We point out though, that while 
the inexact methods need more iteration steps, they are much faster than exact 
inverses, since they use the Kronecker representation (7) of the approximate inverse. 
For instance, the setup cost is 3000 times higher, with a growing gap for higher 
polynomial degrees. 


Table 3 Fractional CG iterations with addditive cell-based Schwarz smoothers, exact as well as 
inexact local solution with varying damping factors ô 


Levels Convergence steps to 1078 

2D exact (@ = 0.7) |6=035 |@=042 |@=049 |6=056 |ô=0.63 
4 17.8 28.4 24.8 24.3 30.8 >100 

5 17.3 27.1 23.9 23.8 | 40.7 >100 

6 17.2 26.8 23.7 23.9 58.1 >100 

3D exact (ô = 0.7) |ô=0.35 |ô=0.42 |®=049 |ô=0.56 |ðð= 0.63 
2 20.6 31.8 28.5 25.8 25.0 28.5 

3 20.6 33.3 29.1 26.5 |274 74.8 

4 20.6 32.4 28.6 26.6 47.0 >100 


Two pre- and post-smoothing steps are used, respectively, and the polynomial degree is k = 4 
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5.3 Linear Elasticitiy 


In order to provide an outlook on how to apply this concept to more general 
problems, we consider linear elasticity, namely the Lamé-Navier equations, with 
the bilinear form 


a(u, v) = 2u (e(u), e(v)) + A(V-u, V - v). (9) 


Here, e(u) = (Vu + Vu') is the strain tensor of the displacement field u and (., -) 
denote the appropriate DG discretization with interior penalty terms. 

Consider a Cartesian vertex patch, that is, a patch with all faces aligned with 
the coordinate planes and with tensor product shape functions on each cell. As 
before, let M; be the one-dimensional mass matrix in direction k and L; the matrix 
representing the Laplacian including all face terms introduced by the interior penalty 
formulation. Furthermore, let G} be the matrix associated to the first derivative, 
again including the DG interface terms which arise in products of the form GI 8G. 
With these notions and the three-dimensional Laplacian 


L = M3 & M2 Q Lı + M3 Q L2 Q Mı + L3 & M2 Q Mı, (10) 


we can write the bilinear form a(., .) on the patch in matrix form 


L+M389M Lı M3®G}@G; Gl@MeG, 
Ap=u| M38&G29G! L+M389L:9Mı GlQG28Mı 
G3&M: QG!  G3Q&GI QM, L+L,8M:8QMı 


M3 ® M: ® Lı M3 ® G29 G! G39 M2. QG! 
+A] M3 8 G] ®G, M3 Q L2 Mı G3 9 G] Q Mı (11) 
GI @ M28 Gı G] 8 G28 Mı Lz ® M2 Q Mı 


Clearly, this matrix lacks the simple structure of Kronecker products we 
employed in the previous subsections. Nevertheless, we have Korn’s inequality [10], 
and thus the block diagonal of the left matrix is spectrally equivalent to the matrix 
itself. Consequently, we expect that 


- L + M3 & M2 8 Lı 
Ap =| L+M3@12@M, > 
L+ L38 M2 8 Mı 
(12) 


which has the desired Kronecker product structure, is a good local solver. Indeed, 
Table 4 confirms this expectation. Iteration counts remain almost constant over a 
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Table 4 Solver performance depending on level and polynomial degree k 


ss 39 bs ps ps pr pe F | 
s ps po E E e Ee E 


CG iterations to reduce the residual by 108 preconditioned by h-MG with multiplicative vertex 
patch smoother and approximate local solvers A>!. Only levels with 10* to 10’ degrees of freedom 
are shown. u = 1, à = 1 and the coarse grid consists of 2 x 2 cells 


Table 5 Solver performance depending on Lamé parameters u and A 


Levels (1.35) 
6 $200 
7 >20 
8 Bro þh pse jeo Joz [w 
9 $200 
io $200 
T $200 


CG iterations to reduce the residual by 108 preconditioned by h-MG with block-diagonal smoother. 
Shape functions of degree k = 4 are used. The coarse grid consists of 2 x 2 cells 


wide range of mesh levels and polynomial degrees. Comparing to Table 2, we lose 
less than a factor two, typically requiring 4 steps instead of 3. 

While Korn’s inequality helped us with the left matrix in (11), the matrix 
corresponding to the “grad-div” term in the Lamé—Navier equations has a nontrivial 
kernel and thus its inverse cannot be approximated by a block diagonal. We confirm 
this in Table 5. After augmenting A p by the diagonal terms of the grad-div matrix, 
we vary u and à. As expected, iteration counts increase when A > u to the point, 
where the method becomes infeasible. 

The case A >> p corresponds to an almost incompressible material. Thus, 
this behavior has to be addressed from two sides. First, the discretization must 
be suitable [34]. Then, the local solvers must be able to reduce the divergence 
sufficiently. Here, we have to find ways to implement a smoother like in [39] in an 
efficient way. Its structure prevents us from utilizing the tensor product techniques, 
namely the fast diagonalization method, used so far. 
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As an outlook, we describe a solution approach for two dimensions, which has 
been developed in a recent bachelor’s thesis [64]. The diagonal blocks of the matrix 
Ap are 


Ay = (2U+A)M2@Li+uULz@M,, A= uM:8Lı+(2u +4) L28Mı. (13) 


Both A; and A2 admit a fast diagonalization, for instance 
= si T 
Ay! = (Q28 Q1)(h @ Ai +4281) (Q2@Q1). (14) 


Given the off-diagonal block B = uGs ® G1 +AG2 Q GT, the Schur complement 
of A, is 
p 


S = Aı — B'A; 'B. (15) 
While this is not a sum of Kronecker products, Kronecker singular value decom- 


position (KSVD), see [72, 73], can be utilized to construct an approximation of the 
Schur complement which is fast diagonalizable. We proceed as follows: 


A.1 compute the fast diagonalizations of A; and A2 
A.2 compute the rank-o, KSVD of the inverse diagonal matrix in Eq. 14 


PA 
(8A +4281) 7x °C, @ D; (16) 
i=l 


A.3 compute the rank-2 KSVD 
S:= 2 @A+h@h~s (17) 


of the approximate Schur complement 


PA 
§:= Ai —B" be QC7'01 8 op;'01 | (18) 


i=l 


A.4 compute the fast diagonalization of Ss. 


Then, Gaussian block elimination provides an approximate inverse 


7 1—A;'B|[A;! 0 I 0 
arah E | elta 9 
P 0 l 0 S-!||-B'AȚ I 
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Implementation and evaluation of these smoothers are still work in progress, but 
the thesis [64] suggests fast and robust convergence at least in a finite difference 
context. 

The take-home message from this section is that an efficient approximate solution 
of the local problems in Schwarz smoothers is possible using low-rank tensor 
representations and can be achieved with effort similar to a matrix-free operator 
application in the best case. Finding such low-rank representations is nevertheless 
highly dependent on the differential equation and geometry. Further investigation 
will be directed in particular at dealing with the grad-div operator. 


6 High-performance Simulations of Incompressible Flows 


Computational fluid dynamics (CFD) simulations of turbulent flows at large 
Reynolds number, e.g., Re > 10°, are among those problems that typically require 
a huge amount of computational resources in order to resolve the turbulent flow 
structures in space and time, and have been addressed as an application by the 
ExaDG project. The underlying model problem is given by the incompressible 
Navier-Stokes equations 


0 
S+V- Uu) -vVu+Vp= f, (20) 


V-u=0. (21) 


Scale-resolving simulations for engineering applications typically involve beyond 
O(10!° — 10!!) unknowns (DoFs) and O(10° — 107) time steps. High-performance 
implementations for this type of problem are therefore of paramount importance for 
the CFD community. It is important to stress that implementing a given algorithm 
optimally for a given hardware, i.e., an implementation that performs close to 
the hardware limits, is only one step to achieve the goal of providing efficient 
flow solvers for engineering problems as emphasized in the introduction. While 
the previous sections discussed the second and third term in Eq.(1), namely the 
performance of matrix-free evaluation routines and fast multigrid solvers for high- 
order discretizations, we now also include discretization aspects into the discussion. 
The implementation makes use of the fast matrix-free evaluation routines and 
multigrid solvers discussed in previous sections. 

We use a method of lines approach with high-order DG discretizations in space 
and splitting methods with BDF time integration. Splitting methods separate the 
solution of the incompressible Navier-Stokes equations into sub-problems such 
as a Poisson equation for the pressure and a (convection—)diffusion equation 
for the velocity and are among the most efficient solvers currently known. In a 
first contribution [45], we highlighted that previous discretization methods lack 
robustness, on the one hand in the limit of small time step sizes, and on the other 
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hand in under-resolved scenarios where the spatial discretization only resolves the 
largest scales of the flow. The stability problem for small time step sizes has been 
addressed in detail in [21] where we found that a proper DG discretization of 
velocity-pressure coupling terms is essential to achieve robustness at small time 
steps. In [23], we presented the first high-order DG incompressible flow solver 
that is robust in the under-resolved regime and that relies completely on efficient 
matrix-free evaluation routines. The developed discretization approach is attractive 
as it provides a generic solver for turbulent flow simulations that is robust and 
accurate without the use of explicit turbulence models. Such a technique is known 
as implicit large-eddy simulation in the literature and has the advantage that it 
does not require turbulence model parameters. While this property of high-order 
DG discretizations is already known from discontinuous Galerkin discretizations 
of the compressible Navier-Stokes equations, the work [23] has been the first 
demonstrating this property for DG discretizations of the incompressible Navier— 
Stokes equations. The key ingredient for a robust high-order, L2-conforming 
DG discretization for incompressible flows turns out to be the use of consistent 
stabilization terms that enforce the divergence-free constraint and inter-element 
mass conservation in a weak sense. These requirements can also be included into 
the finite element function spaces by using so-called H (div)-conforming (normal- 
continuous) discretizations that are exactly (pointwise) divergence-free by using 
Raviart-Thomas elements. As investigated in detail in [26], such an approach has 
indeed very similar discretization properties when compared with the stabilized L2- 
conforming approach in practically relevant, under-resolved application scenarios. 
The model has been extended to moving meshes in [20]. 

A detailed performance analysis has been undertaken in [22] where we discuss 
the incompressible flow solver w.r.t. its efficiency according to Eq.(1). Based 
on this efficiency model, we have then compared matrix-free solvers based on 
incompressible and compressible Navier-Stokes formulations in [24] for under- 
resolved turbulent incompressible flows. The compressible solver uses explicit time 
integration and therefore only requires one operator evaluation in every Runge- 
Kutta stage as opposed to the incompressible solver involving the solution of 
linear system of equations such as a pressure Poisson equation within every 
time step. Simple explicit solvers are often considered efficient due to better 
parallel scalability since implicit Krylov solvers with multigrid preconditioning 
involve global communication. However, our work shows a significant performance 
advantage of the incompressible formulation over the compressible one on the node- 
level for sufficient workload. Albeit speed-up factors are higher, it is difficult to 
achieve a performance advantage for the algorithmically simple, explicit-in-time 
compressible solver in the strong-scaling limit in terms of absolute run time. In 
our experience, the potential to outperform an implicit solver at some point in the 
strong-scaling limit has not materialized. We see it as a future challenge to devise 
optimal PDE solvers providing good performance over a wide range of problems 
and hardware platforms due to this high degree of interdisciplinarity. 

We have applied this solver framework to conduct direct numerical simulations of 
turbulent channel flow in [45], the first direct numerical simulation of the turbulent 
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flow over a periodic hill at Re ~ 10* in [46], and to large-eddy simulation of 
the FDA benchmark nozzle problem in [25]. Furthermore, we have developed 
multiscale wall modeling approaches that allow to use the proposed highly efficient 
schemes also for industrial cases with even higher Reynolds numbers than what is 
feasible for wall-resolved large eddy simulation [47]. 

Here, we show performance results obtained on SuperMUC-NG with Intel 
Skylake CPUs. We study the three-dimensional Taylor—Green vortex problem as 
a standard benchmark to assess the accuracy and computational efficiency of 
incompressible turbulent flow solvers. Regarding discretization accuracy and from 
a physical point of view, the quantity of interest is the kinetic energy dissipation 
rate shown in Fig. 12 as a function of time 0 < ¢ < T = 20 for increasing 
Reynolds numbers Re = 100, 200, 400, 800, 1600, 3000, 10,000, oo. The first 
direct numerical simulation for the Re = 1600 case with a high-order DG scheme 
of the incompressible Navier-Stokes equations with a resolution of 1024? and 
polynomial degrees k = 3,7 has been shown in [22]. Here, we show results 
for effective resolutions up to 3072? (corresponding to 0.99 - 10!! DoFs) for the 
highest Reynolds number cases. Despite these fine resolutions, grid-converged 
results are achieved only up to Re = 3000. The inviscid problem (Re = oo) is 
most challenging, and the results in Fig. 12 suggest that even finer resolutions are 
required for grid-convergence, a goal that might be achievable in the foreseeable 
future. The largest problem with 0.99 - 10!! DoFs involved 6.6 - 104 time steps and 
required 11.4 h of wall time on 152,064 cores. In terms of degrees of freedom solved 
per time step per core, this results in a throughput of 1.05 MDoFs/s/core. 


-3 
20 X1? À : 


Fig. 12 Taylor-Green vortex: Kinetic energy dissipation rates for two different problem sizes (fine 
mesh as solid line and coarse mesh as dashed-dotted line) for each Re number: The polynomial 
degree is k = 3 and the effective resolutions Neff = (Nele, 1a(k + 1))3 considered are Neff = 
643, 1283 for Re = 100, New = 128°, 256° for Re = 200, 400, Mer = 256°, 5123 for Re = 
800, Neff = 10243, 2048? for Re = 1600, and Neg = 20483, 30723 for Re = 3000, 10000, oo 


ExaDG 217 


a 10! 
2 
a : 
oO 
E zo 10° 
— o 
= (a) 
z z 
F 2 io 
8 2 
T 
z 
10! 107? 
384 3072 25k 304k 48 384 3072 25k 304k 
Number of cores Number of cores 


—O— Neg = 1289 —E— Neg = 2562 —®— Neg = 512? —*— Now = 10243 
—— Neg = 20483 


realtime limit - - — linear scaling 


Fig. 13 Scaling analysis for incompressible flow solver on 3D Taylor—Green vortex with polyno- 
mial degree k = 3 at Re = 1600 and spatial resolutions of 128°, 256°, 512°, 10247, 20483 


Figure 13 shows strong scaling results for the TGV problem at Re = 1600 for 
effective resolutions of 1287, 256°, 1024, 20487 and polynomial degree k = 3. We 
assess strong scalability in terms of absolute run times for the whole application 
(including mesh-generation, setup of data structures, solvers, preconditioners, and 
postprocessing) rather than normalized speed-up factors as the aim of strong scala- 
bility is not only reducing but also minimizing time-to-solution, i.e., demonstrating 
strong-scalability of a code with poor serial performance is meaningless. The results 
in Fig. 13 reveal that we are able to perform the TGV simulations in realtime (twan < 
T = 20s) for spatial resolutions up to 1283. These numbers can be considered 
outstanding and we are not aware of other high-order DG solvers achieving this 
performance, see also the discussions in [22, 24]. The minimum wall time in the 
strong-scaling limit increases on finer meshes due to more time steps (the time step 
size is restricted according to the CFL condition, At ~ 1/h, for the mixed explicit— 
implicit splitting solver used here). For this reason, we also show strong scalability 
in terms of the wall time per time step, to allow extrapolations of how many time 
steps can be solved within a given wall time limit which is the typical use case for 
large-eddy and direct numerical simulations of turbulent flows. In this metric, the 
curves level off around 0.02 — 0.03 s of wall time per time step, independently of 
the spatial resolution. The SuperMUC-NG machine with 3 - 10° cores is too small 
to show the strong scaling limit for the largest problem size with 2048? resolution 
considered here. A parallel efficiency of 80.6% is achieved with a speed-up factor 
of 79.8 when scaling from 3072 cores to 304,128 cores. 
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7 hyper.deal: Extending the Matrix-Free Kernels to Higher 
Dimensions 


The matrix-free kernels developed within the ExaDG project have been imple- 
mented in a recursive manner which enables compilation with arbitrary spatial 
dimension. In order to be compatible with the mesh infrastructure of deal. which 
is restricted to dimensions up to 3, we have developed schemes working on a 
tensor product of two deal.II meshes. This allows extension to 2+2, 2+3, and 3+3 
dimensions. The corresponding framework is currently under development as the 
deal.II-extension hyper .deal [59]. 

The major application that we have in mind are kinetic problems in phase space 
where we use the tensor product of a spatial and a velocity mesh. However, other 
applications might arise such as parameter-dependent flow problems. Table 6 gives 
an overview of computational times on a six-dimensional Vlasov—Poisson problem, 
which involves an advection in the 6D space of the particle density in x and v space 
and the solution of a 3D Poisson equation for finding the electric potential that in 
turn specifies the electric field that transports the density field (cf. [44] for the same 
application tackled with a semi-Lagrangian solver). 

Figure 14 lists the throughput of the matrix-free evaluation of cell integrals for 
the multi-dimensional advection in three to six spatial dimensions for polynomial 
degrees k = 2,3,4,5 for AVX2 and AVX-512 vectorization over elements, 
respectively, without any application-specific tuning at this stage. While throughput 
is very good in 3D and 4D as well as 5D up to k = 4, performance drops 
significantly in 6D because the local arrays in sum factorization exhaust caches, 


Table 6 Contributions to run time on 6D Vlasov—Poisson system on 320 cores with 8.6 billion 
spatial DoFs over 42 time steps 


Category [6D advect total (of which MPI exchange) | integrate v [3D Poisson + electric field 


time [s] |560 (130) 13.0 35.9 
6 6 
É 4 É 4 
O O 
A A 
s s 
22 22 
5 5 
0 0 
3 4 > 6 3 4 5 6 
dimension d dimension d 


Fig. 14 Throughput of cell term for advection as a function of the spatial dimension on Intel 
Skylake with 4-wide vectorization (left) and 8-wide vectorization (right) 
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especially with AVX-512. Vectorization strategies within an element [50] are 
currently under development. 


8 Outlook 


Our work in the ExaDG project presented in this text has resulted in a highly 
competitive finite element framework. We have demonstrated excellent performance 
both for the pure operator evaluation, demonstrated e.g. by the CEED benchmark 
problems, as well as on an application level in computational fluid dynamics. We 
plan to engage in benchmarking also in the future to establish best-practices for the 
high-order finite element community. Furthermore, the evolving hardware landscape 
requires a continued effort, with increasing pressure to additional performance 
improvements on throughput architectures such as GPUs and FPGAs. In addition, 
we plan to extend our hybrid hp-multigrid framework to also handle hp-adaptive 
meshes. Finally, while the results from the Schwarz-based multigrid smoothers are 
very promising from a mathematical point of view, further steps are necessary to 
make them perform optimally on massively parallel hardware, and it is not yet clear 
how an optimal implementation compares in time-to-solution against the simpler 
Chebyshev-based ingredients we have considered on the large scale so far. 


References 


= 


. Alzetta, G., Arndt, D., Bangerth, W., Boddu, V., Brands, B., Davydov, D., Gassmoeller, R., 
Heister, T., Heltai, L., Kormann, K., Kronbichler, M., Maier, M., Pelteret, J.P., Turcksin, B., 
Wells, D.: The deal . II library, version 9.0. J. Numer. Math. 26(4), 173—184 (2018). https:// 
doi.org/10.1515/jnma-2018-0054 

2. Anderson, R., Barker, A., Bramwell, J., Cerveny, J., Dahm, J., Dobrev, V., Dudouit, Y., 
Fisher, A., Kolev, T., Stowell, M., Tomov, V.: MFEM: modular finite element methods (2019). 
mfem.org 

3. Antonietti, P.F., Sarti, M., Verani, M., Zikatanov, L.T.: A uniform additive Schwarz precon- 
ditioner for high-order discontinuous Galerkin approximations of elliptic problems. J. Sci. 
Comput. 70(2), 608—630 (2017). https://doi.org/10.1007/s10915-016-0259-9 

4. Arndt, D., Bangerth, W., Davydov, D., Heister, T., Heltai, L., Kronbichler, M., Maier, M., 
Pelteret, J.-P., Turcksin, B., Wells, D.: The deal.II finite element library: Design, features, and 
insights. Comput. Math. Appl. (2020). https://doi.org/10.1016/j.camwa.2020.02.022 

5. Bastian, P., Engwer, C., Fahlke, J., Geveler, M., Gdddeke, D., Iliev, O., Ippisch, O., Milk, R., 
Mohring, J., Miithing, S., Ohlberger, M., Ribbrock, D., Turek, S.: Hardware-based efficiency 
advances in the EXA-DUNE project. In: Bungartz, H.J., Neumann, P., Nagel, W.E. (eds.) 
Software for Exascale computing—SPPEXA 2013-2015, pp. 3-23. Springer, Cham (2016) 

6. Bastian, P., Müller, E.H., Miithing, S., Piatkowski, M.: Matrix-free multigrid block- 
preconditioners for higher order discontinuous Galerkin discretisations. J. Comput. Phys. 394, 
417-439 (2019). https://doi.org/10.1016/j.jcp.2019.06.001 

7. Bauer, S., Drzisga, D., Mohr, M., Rüde, U., Waluga, C., Wohlmuth, B.: A stencil scaling 

approach for accelerating matrix-free finite element implementations. SIAM J. Sci. Comput. 

40(6), C748-C778 (2018) 


220 D. Arndt et al. 


8. Bergen, B., Hülsemann, F., Rüde, U.: Is 1.7 x 10! unknowns the largest finite element system 
that can be solved today? In: Proceeding of ACM/IEEE Conference Supercomputing (SC’05), 
pp. 5:1-5:14 (2005). https://doi.org/10.1109/SC.2005.38 

9. Brandt, A.: Multi-level adaptive solutions to boundary-value problems. Math. Comput. 31, 
333-390 (1977). https://doi.org/10.1090/S0025-5718-1977-0431719-X 

10. Brenner, S.C.: Korn’s inequalities for piecewise H! vector fields. Math. Comput. 73(247), 
1067-1087 (2004) 

11. Brown, J.: Efficient nonlinear solvers for nodal high-order finite elements in 3D. J. Sci. 
Comput. 45(1-3), 48-63 (2010) 

12. Cantwell, C.D., Sherwin, S.J., Kirby, R.M., Kelly, P.H.J.: Form h to p efficiently: Selecting the 
optimal spectral/hp discretisation in three dimensions. Math. Model. Nat. Phenom. 6, 84—96 
(2011) 

13. Cantwell, C.D., Moxey, D., Comerford, A., Bolis, A., Rocco, G., Mengaldo, G., De Grazia, 
D., Yakovlev, S., Lombard, J.E., Ekelschot, D., Jordi, B., Xu, H., Mohamied, Y., Eskilsson, 
C., Nelson, B., Vos, P., Biotto, C., Kirby, R.M., Sherwin, S.J.: Nektar++: An open-source 
spectral/hp element framework. Comput. Phys. Comm. 192, 205-219 (2015). https://doi.org/ 
10.1016/j.cpc.2015.02.008 

14. Charrier, D.E., Hazelwood, B., Tutlyaeva, E., Bader, M., Dumbser, M., Kudryavtsev, A., 
Moskovsky, A., Weinzierl, T.: Studies on the energy and deep memory behaviour of a cache- 
oblivious, task-based hyperbolic PDE solver. Int. J. High Perf. Comput. Appl. 33(5), 973—986 
(2019). https://doi.org/10.1177/1094342019842645 

15. Clevenger, T.C., Heister, T., Kanschat, G., Kronbichler, M.: A flexible, parallel, adaptive 
geometric multigrid method for FEM. Technical report, arXiv:1904.03317 (2019) 

16. Davydov, D., Kronbichler, M.: Algorithms and data structures for matrix-free finite element 
operators with MPI-parallel sparse multi-vectors. ACM Trans. Parallel Comput. (2020). https:// 
doi.org/10.1145/3399736 

17. Davydov, D., Heister, T., Kronbichler, M., Steinmann, P.: Matrix-free locally adaptive finite 
element solution of density-functional theory with nonorthogonal orbitals and multigrid 
preconditioning. Phys. Status Solidi B: Basic Solid State Phys. 255(9), 1800069 (2018). https:// 
doi.org/10.1002/pssb.201800069 

18. Davydov, D., Pelteret, J.P., Arndt, D., Kronbichler, M., Steinmann, P.: A matrix-free approach 
for finite-strain hyperelastic problems using geometric multigrid. Int. J. Numer. Meth. Eng. 
(2020). https://doi.org/10.1002/nme.6336 

19. Deville, M.O., Fischer, P.F., Mund, E.H.: High-order Methods for Incompressible Fluid Flow, 
vol. 9. Cambridge University, Cambridge (2002) 

20. Fehn, N., Heinz, J., Wall, W.A., Kronbichler, M.: High-order arbitrary Lagrangian-Eulerian 
discontinuous Galerkin methods for the incompressible Navier-Stokes equations. Technical 
report, arXiv:2003.07166 (2020). 

21. Fehn, N., Wall, W.A., Kronbichler, M.: On the stability of projection methods for the incom- 
pressible Navier-Stokes equations based on high-order discontinuous Galerkin discretizations. 
J. Comput. Phys. 351, 392—421 (2017). https://doi.org/10.1016/j.jcp.2017.09.031 

22. Fehn, N., Wall, W.A., Kronbichler, M.: Efficiency of high-performance discontinuous Galerkin 
spectral element methods for under-resolved turbulent incompressible flows. Int. J. Numer. 
Meth. Fluids 88(1), 32-54 (2018). https://doi.org/10.1002/fld.4511 

23. Fehn, N., Wall, W.A., Kronbichler, M.: Robust and efficient discontinuous Galerkin methods 
for under-resolved turbulent incompressible flows. J. Comput. Phys. 372, 667-693 (2018). 
https://doi.org/10.1016/j.jcp.2018.06.037 

24. Fehn, N., Wall, W.A., Kronbichler, M.: A matrix-free high-order discontinuous Galerkin com- 
pressible Navier-Stokes solver: a performance comparison of compressible and incompressible 
formulations for turbulent incompressible flows. Int. J. Numer. Meth. Fluids 89(3), 71—102 
(2019). https://doi.org/10.1002/fld.4683 


ExaDG 221 


25; 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42. 


Fehn, N., Wall, W.A., Kronbichler, M.: Modern discontinuous Galerkin methods for the 
simulation of transitional and turbulent flows in biomedical engineering: a comprehensive LES 
study of the FDA benchmark nozzle model. Int. J. Numer. Meth. Biomed. Eng. 35(12), e3228 
(2019). https://doi.org/10.1002/cnm.3228 

Fehn, N., Kronbichler, M., Lehrenfeld, C., Lube, G., Schroeder, P.W.: High-order DG solvers 
for under-resolved turbulent incompressible flows: a comparison of L? and H (div) methods. 
Int. J. Numer. Meth. Fluids 91(11), 533-556 (2019). https://doi.org/10.1002/fld.4763 

Fehn, N., Munch, P., Wall, W.A., Kronbichler, M.: Hybrid multigrid methods for high-order 
discontinuous Galerkin discretizations. J. Comput. Phys. (2020). https://doi.org/10.1016/j.jcp. 
2020. 109538 

Fischer, P., Kerkemeier, S., Peplinski, A., Shaver, D., Tomboulides, A., Min, M., Obabko, A., 
Merzari, E.: Nek5000 Web page (2019). https://nek5000.mcs.anl.gov 

Fischer, P., Min, M., Rathnayake, T., Dutta, S., Kolev, T., Dobrev, V., Camier, J.S., Kronbichler, 
M., Warburton, T., Swirydowicz, K., Brown, J.: Scalability of high-performance PDE solvers. 
Int. J. High Perf. Comput. Appl. (2020). https://doi.org/10.1177/10943420209 15762 
Gholami, A., Malhotra, D., Sundar, H., Biros, G.: FFT, FMM, or multigrid? A comparative 
study of state-of-the-art Poisson solvers for uniform and nonuniform grids in the unit cube. 
SIAM J. Sci. Comput. 38(3), C280—C306 (2016). https://doi.org/10.1137/15M1010798 
Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., Wohlmuth, B.: Towards textbook efficiency 
for parallel multigrid. Numer. Math.-Theory Me. Appl. 8(1), 22—46 (2015) 

Gmeiner, B., Huber, M., John, L., Rüde, U., Wohlmuth, B.: A quantitative perfor- 
mance study for Stokes solvers at the extreme scale. J. Comput. Sci. 17, 509- 
521 (2016). https://doi.org/10.1016/j.jocs.2016.06.006. http://www.sciencedirect.com/science/ 
article/pii/S1877750316301077. Recent Advances in Parallel Techniques for Scientific Com- 
puting 

Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and 
Engineers. CRC Press, Boca Raton (2011) 

Hansbo, P., Larson, M.G.: Discontinuous Galerkin methods for incompressible and nearly 
incompressible elasticity by Nitsche’s method. Comput. Methods Appl. Mech. Eng. 191, 1895- 
1908 (2002) 

Ibeid, H., Olson, L., Gropp, W.: FFT, FMM, and multigrid on the road to exascale: performance 
challenges and opportunities. J. Parallel Distrib. Comput. 136, 63—74 (2020). https://doi.org/ 
10.1016/j.jpdc.2019.09.014 

Janssen, B., Kanschat, G.: Adaptive multilevel methods with local smoothing for H !_ and 


your -conforming high order finite element methods. SIAM J. Sci. Comput. 33(4), 2095-2114 
(2011). https://doi.org/10.1137/090778523 

Kanschat, G.: Multi-level methods for discontinuous Galerkin FEM on locally refined meshes. 
Comput. Struct. 82(28), 2437-2445 (2004). https://doi.org/10.1016/j.compstruc.2004.04.015 
Kanschat, G.: Robust smoothers for high order discontinuous Galerkin discretizations of 
advection-diffusion problems. J. Comput. Appl. Math. 218, 53-60 (2008). https://doi.org/10. 
1016/j.cam.2007.04.032 

Kanschat, G., Mao, Y.: Multigrid methods for HİV -conforming discontinuous Galerkin 
methods for the Stokes equations. J. Numer. Math. 23(1), 51—66 (2015). https://doi.org/10. 
1515/jnma-2015-0005 

Kempf, D., Hess, R., Müthing, S., Bastian, P.: Automatic code generation for high-performance 
discontinuous Galerkin methods on modern architectures. Technical report, arXiv:1812.08075 
(2018) 

Knepley, M.G., Brown, J., Rupp, K., Smith, B.F.: Achieving high performance with unified 
residual evaluation. Technical report, arXiv:1309.1204 (2013) 

Kormann, K.: A time-space adaptive method for the Schrödinger equation. Commun. Comput. 
Phys. 20(1), 60-85 (2016). https://doi.org/10.4208/cicp.101214.021015a 


222 D. Arndt et al. 


43. Kormann, K., Kronbichler, M.: Parallel finite element operator application: graph partitioning 
and coloring. In: Proceeding of 7th IEEE International Conference eScience, pp. 332-339 
(2011). https://10.1109/eScience.2011.53 

44. Kormann, K., Reuter, K., Rampp, M.: A massively parallel semi-Lagrangian solver for the six- 
dimensional Vlasov—Poisson equation. Int. J. High Perform. Comput. Appl. 33(5), 924-947 
(2019). https://doi.org/10.1177/1094342019834644 

45. Krank, B., Fehn, N., Wall, W.A., Kronbichler, M.: A high-order semi-explicit discontinuous 
Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent 
channel flow. J. Comput. Phys. 348, 634-659 (2017). https://doi.org/10.1016/j.jcep.2017.07. 
039 

46. Krank, B., Kronbichler, M., Wall, W.A.: Direct numerical simulation of flow over periodic hills 
up to Re, = 10,595. Flow Turbulence Combust. 101, 521-551 (2018). https://doi.org/10.1007/ 
$10494-018-9941-3 

47. Krank, B., Kronbichler, M., Wall, W.A.: A multiscale approach to hybrid RANS/LES wall 
modeling within a high-order discontinuous Galerkin scheme using function enrichment. Int. 
J. Numer. Meth. Fluids 90, 81-113 (2019). https://doi.org/10.1002/fld.47 12 

48. Kronbichler, M., Allalen, M.: Efficient high-order discontinuous Galerkin finite elements with 
matrix-free implementations. In: Bungartz, H.J., Kranzlmiiller, D., Weinberg, V., Weismiiller, 
J., Wohlgemuth, V. (eds.) Advances and New Trends in Environmental Informatics, pp. 89- 
110. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-99654-7_7 

49. Kronbichler, M., Kormann, K.: A generic interface for parallel cell-based finite element 
operator application. Comput. Fluids 63, 135-147 (2012). https://doi.org/10.1016/j.compfiuid. 
2012.04.012 

50. Kronbichler, M., Kormann, K.: Fast matrix-free evaluation of discontinuous Galerkin finite 
element operators. ACM Trans. Math. Softw. 45(3), 29:1-29:40 (2019). https://doi.org/10. 
1145/3325864 

51. Kronbichler, M., Ljungkvist, K.: Multigrid for matrix-free high-order finite element computa- 
tions on graphics processors. ACM Trans. Parallel Comput. 6(1), 2:1—2:32 (2019). https://doi. 
org/10.1145/3322813 

52. Kronbichler, M., Wall, W.A.: A performance comparison of continuous and discontinuous 
Galerkin methods with fast multigrid solvers. SIAM J. Sci. Comput. 40(5), A3423-A3448 
(2018). https://doi.org/10.1137/16M110455X 

53. Kronbichler, M., Schoeder, S., Miiller, C., Wall, W.A.: Comparison of implicit and explicit 
hybridizable discontinuous Galerkin methods for the acoustic wave equation. Int. J. Numer. 
Meth. Eng. 106(9), 712-739 (2016). https://doi.org/10.1002/nme.5 137 

54. Kronbichler, M., Kormann, K., Pasichnyk, I., Allalen, M.: Fast matrix-free discontinuous 
Galerkin kernels on modern computer architectures. In: Kunkel, J.M., Yokota, R., Balaji, P., 
Keyes, D.E. (eds.) ISC High Performance 2017, LNCS 10266, pp. 237-255 (2017). https:// 
doi.org/10.1007/978-3-319-58667-013 

55. Kronbichler, M., Diagne, A., Holmgren, H.: A fast massively parallel two-phase flow solver for 
microfluidic chip simulation. Int. J. High Perf. Comput. Appl. 32(2), 266-287 (2018). https:// 
doi.org/10.1177/1094342016671790 

56. Kronbichler, M., Kormann, K., Fehn, N., Munch, P., Witte, J.: A Hermite-like basis for faster 
matrix-free evaluation of interior penalty discontinuous Galerkin operators. Technical report, 
arXiv:1907.08492 (2019) 

57. Ljungkvist, K.: Matrix-free finite-element computations on graphics processors with adap- 
tively refined unstructured meshes. In: Proceedings of the 25th High Performance Computing 
Symposium, HPC °17, pp. 1:1—1:12. Society for Computer Simulation International, San Diego 
(2017). http://dl.acm.org/citation.cfm?id=3 108096.3 108097 

58. Lynch, R.E., Rice, J.R., Thomas, D.H.: Direct solution of partial difference equations by tensor 
product methods. Numer. Math. 6, 185-199 (1964). https://doi.org/10.1007/BF01386067 

59. Munch, P., Kormann, K., Kronbichler, M.: hyper.deal: An efficient, matrix-free finite-element 
library for high-dimensional partial differential equations. Technical report, arXiv:2002.08110 
(2020) 


ExaDG 223 


60. 


61. 


62. 


63. 


64. 


65. 


66. 


67. 


68. 


69. 


70. 


71. 


72. 
T3; 
74. 
75. 


76. 


Müthing, S., Piatkowski, M., Bastian, P.: High-performance implementation of matrix-free 
high-order discontinuous Galerkin methods. Technical report, arXiv:1711.10885 (2017) 
Orszag, S.A.: Spectral methods for problems in complex geometries. J. Comput. Phys. 37, 
70-92 (1980) 

Raffenetti, K., Amer, A., Oden, L., Archer, C., Bland, W., Fujita, H., Guo, Y., Janjusic, T., 
Durnov, D., Blocksome, M., Si, M., Seo, S., Langer, A., Zheng, G., Takagi, M., Coffman, P., 
Jose, J., Sur, S., Sannikov, A., Oblomov, S., Chuvelev, M., Hatanaka, M., Zhao, X., Fischer, 
P., Rathnayake, T., Otten, M., Min, M., Balaji, P.: Why is MPI so slow?: Analyzing the 
fundamental limits in implementing MPI-3.1. In: Proceedings of the International Conference 
for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 62:1-62:12. 
ACM, New York (2017). https://doi.org/10.1145/3126908.3 126963 

Rathgeber, F., Ham, D.A., Mitchell, L., Lange, M., Luporini, F., McRae, A.T.T., Bercea, G.T., 
Markall, G.R., Kelly, P.H.J.: Firedrake: automating the finite element method by composing 
abstractions. ACM Trans. Math. Soft. 43(3), 24:1-24:27 (2017). https://doi.org/10.1145/ 
2998441 

Schmidt, S.: Fast, tensor-based solution of problems involving incompressibility, Bachelor 
thesis. Heidelberg University, Heidelberg (2019) 

Schoeder, S., Kormann, K., Wall, W.A., Kronbichler, M.: Efficient explicit time stepping of 
high order discontinuous Galerkin schemes for waves. SIAM J. Sci. Comput. 40(6), C803- 
C826 (2018). https://doi.org/10.1137/18M1185399 

Schoeder, S., Kronbichler, M., Wall, W.: Arbitrary high-order explicit hybridizable discontin- 
uous Galerkin methods for the acoustic wave equation. J. Sci. Comput. 76, 969-1006 (2018). 
https://doi.org/10.1007/s10915-018-0649-2 

Schoeder, S., Sticko, S., Kreiss, G., Kronbichler, M.: High-order cut discontinuous Galerkin 
methods with local time stepping for acoustics. Int. J. Numer. Meth. Eng. (2020). https://doi. 
org/10.1002/nme.6343 

Schoeder, S., Wall, W.A., Kronbichler, M.: ExWave: A high performance discontinuous 
Galerkin solver for the acoustic wave equation. Soft. X 9, 49-54 (2019). https://doi.org/10. 
1016/j.softx.2019.01.001 

Solomonoff, A.: A fast algorithm for spectral differentiation. J. Comput. Phys. 98(1), 174-177 
(1992). https://doi.org/10.1016/0021-9991(92)90182-X 

Sundar, H., Biros, G., Burstedde, C., Rudi, J., Ghattas, O., Stadler, G.: Parallel geometric- 
algebraic multigrid on unstructured forests of octrees. In: Proceedings of the International 
Conference on High Performance Computing, Networking, Storage and Analysis, p. 43. IEEE 
Computer Society, Silver Spring (2012) 

Świrydowicz, K., Chalmers, N., Karakus, A., Warburton, T.: Acceleration of tensor-product 
operations for high-order finite element methods. Int. J. High Perf. Comput. Appl. 33(4), 735- 
757 (2019). https://doi.org/10.1177/1094342018816368 

Van Loan, C.F.: The ubiquitous Kronecker product. J. Comput. Appl. Math. 123(1-2), 85—100 
(2000) 

Van Loan, C.F., Pitsianis, N.: Approximation with Kronecker products. In: Linear Algebra for 
Large Scale and Real-time Applications, pp. 293-314. Springer, Berlin (1993) 

Varga, R.S.: Matrix Iterative Analysis, 2nd edn. Springer, Berlin (2009) 

Wichmann, K.R., Kronbichler, M., Lohner, R., Wall, W.A.: Practical applicability of optimiza- 
tions and performance models to complex stencil-based loop kernels in CFD. Int. J. High Perf. 
Comput. Appl. 33(4), 602-618 (2019). https://doi.org/10.1177/1094342018774126 

Witte, J., Arndt, D., Kanschat, G.: Fast tensor product Schwarz smoothers for high-order 
discontinuous Galerkin methods. Technical report, arXiv:1910.11239 (2019) 


224 D. Arndt et al. 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons licence and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons licence, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons licence and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Exa-Dune—Flexible PDE Solvers, A 
Numerical Methods and Applications oe 


Peter Bastian, Mirco Altenbernd, Nils-Arne Dreier, Christian Engwer, 
Jorrit Fahlke, René Fritze, Markus Geveler, Dominik Géddeke, Oleg Iliev, 
Olaf Ippisch, Jan Mohring, Steffen Miithing, Mario Ohlberger, 

Dirk Ribbrock, Nikolay Shegunov, and Stefan Turek 


Abstract In the EXA-DUNE project we have developed, implemented and opti- 
mised numerical algorithms and software for the scalable solution of partial 
differential equations (PDEs) on future exascale systems exhibiting a heterogeneous 
massively parallel architecture. In order to cope with the increased probability 
of hardware failures, one aim of the project was to add flexible, application- 
oriented resilience capabilities into the framework. Continuous improvement of the 
underlying hardware-oriented numerical methods have included GPU-based sparse 
approximate inverses, matrix-free sum-factorisation for high-order discontinuous 
Galerkin discretisations as well as partially matrix-free preconditioners. On top 
of that, additional scalability is facilitated by exploiting massive coarse grained 
parallelism offered by multiscale and uncertainty quantification methods where we 
have focused on the adaptive choice of the coarse/fine scale and the overlap region as 
well as the combination of local reduced basis multiscale methods and the multilevel 
Monte-Carlo algorithm. Finally, some of the concepts are applied in a land-surface 
model including subsurface flow and surface runoff. 


P. Bastian (4) - S. Miithing 
Heidelberg University, Heidelberg, Germany 
e-mail: peter.bastian @iwr.uni-heidelberg.de 


N.-A. Dreier - C. Engwer - J. Fahlke - R. Fritze - M. Ohlberger 
University of Miinster, Miinster, Germany 


M. Altenbernd - D. Géddeke 
University of Stuttgart, Stuttgart, Germany 


M. Geveler - D. Ribbrock - S. Turek 
TU Dortmund, Dortmund, Germany 


O. Iliev - J. Mohring - N. Shegunov 
Fraunhofer ITWM, Kaiserslautern, Germany 


O. Ippisch 
TU Clausthal-Zellerfeld, Clausthal-Zellerfeld, Germany 


© The Author(s) 2020 225 
H.-J. Bungartz et al. (eds.), Software for Exascale Computing - SPPEXA 

2016-2019, Lecture Notes in Computational Science and Engineering 136, 
https://doi.org/10.1007/978-3-030-47956-5_9 


226 P. Bastian et al. 
1 Introduction 


In the EXA-DUNE project we extend the Distributed and Unified Numerics Envi- 
ronment (DUNE)!  [6, 7] by hardware-oriented numerical methods and hardware- 
aware implementation techniques developed in the (now) FEAT3* [55] project to 
provide an exascale-ready software framework for the numerical solution of a large 
variety of partial differential equation (PDE) systems with state-of-the-art numerical 
methods including higher-order discretisation schemes, multi-level iterative solvers, 
unstructured and locally-refined meshes, multiscale methods and uncertainty quan- 
tification, while achieving close-to-peak performance and exploiting the underlying 
hardware. 

In the first funding period we concentrated on the node-level performance as the 
framework and in particular its algebraic multigrid solver already show very good 
scalability in MPI-only mode as documented by the inclusion of DUNE’s solver 
library in the High-Q-Club, the codes scaling to the full machine in Jiilich at the 
time, with close to half a million cores. Improving the node-level performance 
in light of future exascale hardware involved multithreading (“MPI+X”) and in 
particular exploiting SIMD parallelism (vector extensions of modern CPUs and 
accelerator architectures). These aspects were addressed within the finite element 
assembly and iterative solution phases. Matrix-free methods evaluate the discrete 
operator without storing a matrix, as the name implies, and promise to be able to 
achieve a substantial fraction of peak performance. Matrix-based approaches on the 
other hand are limited by memory bandwidth (at least) in the solution phase and 
thus typically exhibit only a small fraction of the peak (GFLOP/s) performance of 
a node, but decades of research have led to robust and efficient (in terms of number 
of iterations) iterative linear solvers for practically relevant systems. Importantly, a 
consideration of matrix-free and matrix-based methods needs to take the order of 
the method into account. For low-order methods it is imperative that a matrix entry 
can be recomputed in less time than it takes to read it from memory, to counteract 
the memory wall problem. This requires to exploit the problem structure as much 
as possible, i.e., to rely on constant coefficients, (locally) regular mesh structure and 
linear element transformations [28, 37]. In these cases it is even possible to apply 
stencil type techniques, like developed in the EXA-STENCIL project [40]. On the 
other hand, for high-order methods with tensor-product structure the complexity 
of matrix-free operator evaluation can be much less than that of matrix-vector 
multiplication, meaning that less floating-point operations have to be performed 
which at the same time can be executed at a higher rate due to reduced memory 
pressure and better suitability for vectorization [12, 39, 50]. This makes high-order 
methods extremely attractive for exascale machines [48, 51]. 


"http://www.dune-project.org/. 
*http://feast.tu-dortmund.de/. 
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In the second funding phase we have mostly concentrated on the following 
aspects: 


1. Asynchronicity and fault tolerance: High-level C++ abstractions form the 
basis of transparent error handling using exceptions in a parallel environment, 
fault-tolerant multigrid solvers as well as communication hiding Krylov methods. 

2. Hardware-aware solvers for PDEs: We investigated matrix-based sparse- 
approximate inverse preconditioners including novel machine-learning 
approaches, vectorization through multiple right-hand sides as well as matrix- 
free high-order Discontinous Galerkin (DG) methods and partially matrix-free 
robust preconditioners based on algebraic multigrid (AMG). 

3. Multiscale (MS) and uncertainty quantification (UQ) methods: These meth- 
ods provide an additional layer of embarrassingly parallel tasks on top of the 
efficiently parallelized forward solvers. A challenge here is load balancing of the 
asynchronous tasks which has been investigated in the context of the localized 
reduced basis multiscale method and multilevel Monte Carlo methods. 

4. Applications: We have considered large-scale water transport in the subsurface 
coupled to surface flow as an application where the discretization and solver 
components can be applied. 


In the community, there is broad consensus on the assumptions about exascale 
systems that did not change much during the course of this 6 year project. A 
report by the Exascale Mathematics Working Group to the U.S. Department of 
Energy’s Advanced Scientific Computing Research Program [16] summarises these 
challenges as follows, in line with [35] and more recently the Exascale Computing 
Project:> (1) The anticipated power envelope of 20 MW implies strong limitations 
on the amount and organisation of the hardware components, an even stronger 
necessity to fully exploit them, and eventually even power-awareness in algorithms 
and software. (2) The main performance difference from peta- to exascale will 
be through a 100-1000 fold increase in parallelism at the node level, leading to 
extreme levels of concurrency and increasing heterogeneity through specialised 
accelerator cores and wide vector instructions. (3) The amount of memory per 
‘core’ and the memory and interconnect bandwidth/latency will only increase at 
a much smaller rate, hence increasing the demand for lower memory footprints and 
higher data locality. (4) Finally, hardware failures, and thus the mean-time-between- 
failure (MTBF), were expected to increase proportionally (or worse) corresponding 
to the increasing number of components. Recent studies have indeed confirmed 
this expectation [30], although not at the projected rate. First exascale systems are 
scheduled for 2020 in China [42], 2021 in the US and 2023 [25] in Europe. Although 
the details are not yet fully disclosed, it seems that the number of nodes will not be 
larger than 10° and will thus remain in the range of previous machines such as the 


3https://www.exascaleproject.org/. 
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BlueGene. The major challenge will thus be to exploit the node level performance 
of more than 10 TFLOP/s. 

The rest of this paper is organized as follows. In Sect. 2 we lay the foundations of 
asynchronicity and resilience, while Sect. 3 discusses several aspects of hardware- 
aware and scalable iterative linear solvers. These building blocks will then be used in 
Sects. 4 and 5 to drive localized reduced basis and multilevel Monte-Carlo methods. 
Finally, Sect. 6 covers our surface-subsurface flow application. 


2 Asynchronicity and Fault Tolerance 


As predicted in the first funding period, latency has indeed become a major issue, 
both within a single node as well as between different MPI ranks. The core concept 
underlying all latency- and communication-hiding techniques is asynchronicity. 
This is also crucial to efficiently implement certain local-failure local-recovery 
methods. Following the DUNE philosophy, we have designed a generic layer that 
abstracts the use of asynchronicity in MPI from the user. In the following, we first 
describe this layer and its implementation, followed by representative examples 
on how to build middleware infrastructure on it, and on its use for s-step Krylov 
methods and fault tolerance beyond global checkpoint-restart techniques. 


2.1 Abstract Layer for Asynchronicity 


We first introduce a general abstraction for asynchronicity in parallel MPI appli- 
cations, which we developed for DUNE. While we integrated these abstractions 
with the DUNE framework, most of the code can easily be imported into other 
applications, and is available as a standalone library. 

The C++ API for MPI was dropped from MPI-3 since it offered no real 
advantage over the C bindings, beyond being a simple wrapper layer. Most MPI 
users coding in C++ are still using the C bindings, writing their own C++ 
interface/layer, in particular in more generic software frameworks. At the same time 
the C++11 standard introduced high-level concurrency concepts, in particular the 
future/promise construct to enable an asynchronous program flow while maintaining 
value semantics. We adopt this approach as a first principle in our MPI layer to 
handle asynchronous MPI operations and propose a high-level C++ MPI interface, 
which we provide in DUNE under the generic interface of Dune: : Communication 
and a specific implementation Dune : :MPICommunication. 

An additional issue of the concrete MPI library in conjunction with C++ is the 
error handling concept. In C++, exceptions are the advocated approach to handle 
error propagation. As exceptions change the local code path on the, e.g., failing 
process in a hard fault scenario, exceptions can easily lead to a deadlock. As we 
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discuss later, the introduction of our asynchronous abstraction layer enables global 
error handling in an exception friendly manner. 

In concurrent environments a C++ future decouples values from the actual 
computation (promise). The program flow can continue while a thread is computing 
the actual result and promotes this via promise to the future. The MPI C and Fortran 
interfaces offer asynchronous operations, but in contrast to thread parallel, the user 
does not specify the operation within the concurrent operation. Actually, MPI on 
its own does not offer any real concurrency at all, and provides instead a handle- 
based programming interface to avoid certain cases of deadlocks: the control flow is 
allowed to continue without finishing the communication, while the communication 
usually only proceeds when calls into the MPI library are executed. 

We developed a C++ layer on top of the asynchronous MPI operations, which 
follows the design of the C++11 future. Note that the actual std: : future class 
cannot be used for this purpose. 


template<typename T> 
class Future { 
void wait(); 
bool ready() const; 
bool valid() const; 
T get(); 
he 


As different implementations like thread-based std: : future, task-based TBB 
::future, and our new MPIFuture are available, usability greatly benefits from 
a dynamically typed interface. This is a reasonable approach, as std: : future 
is using a dynamical interface already and also the MPI operations are coarse 
grained, so that the additional overhead of virtual function calls is negligible. At the 
same time the user expects a future to offer value semantics, which contradicts the 
usual pointer semantics used for dynamic polymorphism. In EXA-DUNE we decided 
to implement type-erasure to offer a clean and still flexible user interface. An 
MPIFuture is responsible for handling all states associated with an MPI operation. 


class MPIFuture{ 

private: 
mutable MPI Request req_; 
mutable MPI Status status; 
impl::Buffer<R> data_j; 
impl::Buffer<S> send _data_j; 

public: 


ee 


The future holds a mutable MPI_Request and MPI_Status to access information 
on the current operation and it holds buffer objects, which manage the actual data. 
These buffers offer a great additional value, as we do not access the raw data 
directly, but can include data transformation and varying ownership. For example 
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it is now possible to directly send an std: : vector<double>, where the receiver 
automatically resizes the std: : vector according to the incoming data stream. 
This abstraction layer enables different use cases, highlighted below: 


1. Parallel C++ exception handling: Exceptions are the recommended way to 
handle faults in C++ programs. As exceptions alter the execution path of a single 
node, they are not suitable for parallel programs. As asynchronicity allows for 
moderately diverging execution paths, we can use it to implement parallel error 
propagation using exceptions. 

2. Solvers and preconditioners tolerant to hard and soft faults: This functional- 
ity is used for failure propagation, restoration of MPI in case of a hard fault, and 
asynchronous in-memory checkpointing. 

3. Asynchronous Krylov solvers: Scalar products in Krylov methods require 
global communication. Asynchronicity can be used to hide the latency and 
improve strong scalability. 

4. Asynchronous parallel IO: The layer allows to transform any non-blocking 
MPI operation into a really asynchronous operation. This allows also to support 
asynchronous IO, to hide the latency of write operations and overlap with the 
computation of the next iteration or time step. 

5. Parallel localized reduced basis methods: Asynchronicity will be used to 
mitigate the load-imbalance inherent in the error estimator guided adaptive online 
enrichment of local reduced bases. 


2.2 Parallel C++ Exception Handling 


In parallel numerical algorithms, unexpected behaviour can occur quite frequently: 
a solver could diverge, the input of a component (e.g., the mesher) could be 
inappropriate for another component (e.g., the discretiser), etc. A well-written code 
should detect unexpected behaviour and provide users with a possibility to react 
appropriately in their own programs, instead of simply terminating with some error 
code. For C++, exceptions are the recommended method to handle this. With well 
placed exceptions and corresponding try-catch blocks, it is possible to accomplish 
a more robust program behaviour. However, the current MPI specification [44] 
does not define any way to propagate exceptions from one rank (process) to 
another. In the case of unexpected behaviour within the MPI layer itself, MPI 
programs simply terminate, maybe after a time-out. This is a design decision that 
unfortunately implies a severe disadvantage in C++, when combined with the 
ideally asynchronous progress of computation and communication: an exception 
that is thrown locally by some rank can currently lead to a communication deadlock, 
or ultimately even to undesired program termination. Even though exceptions are 
technically an illegal use of the MPI standard (a peer no longer participates in a 
communication), it undesirably conflicts with the C++ concept of error handling. 
Building on top of the asynchronicity layer, we have developed an approach to 
enable parallel C++ exceptions. We follow C+-+11 techniques, e.g., use future-like 
abstractions to handle asynchronous communication. Our currently implemented 
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interface requires ULFM [11], an MPI extension to restore communicators after 
rank losses, which is scheduled for inclusion into MPI-4. We also provide a fallback 
solution for non-ULFM MPI installations, that employs an additional communicator 
for propagation and can, by construction, not handle hard faults, i.e., the loss of a 
node resulting in the loss of rank(s) in some communicator. 

To detect exceptions in the code we have extended the Dune: :MPIGuard, 
that previously only implemented the scope guard concept to detect and react 
on local exceptions. Our extension revokes the MPI communicator using the 
ULFM functionality if an exception is detected, so that it is now possible to use 
communication inside a block with scope guard. This makes it superfluous to call 
the finalize and reactivate methods of the MpIGuard before and after each 
communication. 


try{ 
MPIGuard guard (comm) ; 


do_something () ; 

communicate (comm) ; 
WeENeelsl CH 

comm. shrink () ; 

recover (comm) ; 


} 


Listing 1 MPIGuard 


Listing 1 shows an example how to use the MPIGuard and recover the communicator 
in a node loss scenario. In this example, an exception that is thrown only on a 
few ranks in do_something() will not lead to a deadlock, since the MPIGuard 
would revoke the communicator. Details of the implementation and further 
descriptions are available in a previous publication [18]. We provide the “black- 
channel” fallback implementation as a standalone version.* This library uses 
the P-interface of the MPI standard, which makes it possible to redefine MPI 
functions. At the initialization of the MPI setting the library creates an opaque 
communicator, called blackchannel, on which a pending MPI_Irecv request is 
waiting. Once a communicator is revoked, the revoking rank sends messages to 
the pending blackchannel request. To avoid deadlocks, we use MPI_Waitany to 
wait for a request, which listens also for the blackchannel request. All blocking 
communication is redirected to non-blocking calls using the P-interface. The library 
is linked via LD_PRELOAD which makes it usable without recompilation and could 
be removed easily once a proper ULFM implementation is available in MPI. 
Figure | shows a benchmark comparing the time which is used for duplicating a 
communicator, revoking it and restore a valid state. The benchmark was performed 
on PALMA2, the HPC cluster of the University of Muenster. Three implementations 
are compared; OpenMPI_BC and IntelMPI_BC are using the blackchannel library 
based on OpenMPI and IntelMPI, respectively. OpenMPI_ULFM uses the ULFM 


*https://gitlab.dune-project.org/exadune/blackchannel-ulfm. 
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Fig. 1 Benchmark of different MPI implementations: 12 nodes with 36 processes (left), 48 nodes 
with 36 processes (right), cf. [18] 


implementation provided by fault-tolerance.org, which is based on OpenMPI. 
We performed 100 measurements for each implementation. The blackchannel 
implementation is competitive to the ULFM implementation. As OpenMPI is in 
this configuration not optimized and does not use the RDMA capabilities of the 
interconnect, it is slower than the IntelMPI implementation. The speed up of the 
OpenMPI_ULFM version compared to the OpenMPI_BC version is due to the better 
communication strategy. 


2.3 Compressed in-Memory Checkpointing for Linear Solvers 


The previously described parallel exception propagation, rank loss detection and 
communicator restoration by using the ULFM extension, allow us to implement a 
flexible in-memory checkpointing technique which has the potential to recover from 
hard faults on-the-fly without any user interaction. Our implementation establishes a 
backup and recovery strategy which in part is based on a local-failure local-recovery 
(LFLR) [54] approach, and involves lossy compression techniques to reduce the 
memory footprint as well as bandwidth pressure. The contents of this subsection 
have not been published previously. 


Modified Solver Interface To enable the use of exception propagation as 
illustrated in the previous section and to implement different backup recovery 
approaches we kept all necessary modifications to DUNE-ISTL, the linear solver 
library. We embed the solver initialisation and the iterative loop in a try-catch block, 
and provide additional entry and execution points for recovery and backup, see 
Listing 2 for details. Default settings are provided on the user level, i.e., DUNE- 
PDELAB. 
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1 init variables (); 
done = false; 
while (!done) try { 
MPIGuard guard (comm) ; 
if (this->processRecovery(...)) 
6 reinit execution () ; 
} else { 
init execution () ; 
} 
10 for (i=0 ; ic<=maxit; i++ ) { 
11 do_iteration () ; 
12 if (converged) { 
13 done = true; 
14 break; 
is } 


16 this->processBackup(...); 


} 


is } catch (Exception & e) { 

19 done = false; 

20 comm.reconstitute(); 

21 if (!this->sprocessOnException(...)) 
throw; 


Listing 2 Solver modifications 


This implementation ensures that the iterative solving process is active until the 
convergence criterion is reached. An exception inside the try-block on any rank is 
detected by the MPIGuard and propagated to all other ranks, so that all ranks will 
jump to the catch-block. 

This catch-block can be specialised for different kind of exceptions, e.g., if a 
solver has diverged and a corresponding exception is thrown it could define some 
specific routine to define a modified restart with a possibly more robust setting 
and/or initial guess. The catch-block in Listing 2 exemplarily shows a possible 
solution in the scenario of a communicator failure, e.g., a node loss which is 
detected by using the ULFM extension to MPI, encapsulated by our wrapper for 
MPI exceptions. Following the detection and propagation, all still valid ranks end 
up in the catch-block and the communicator must be re-established in some way 
(Listing 2, line 20). This can be done by shrinking the communicator or replacing 
lost nodes by some previously allocated spare ones. After the communicator 
reconstitution a user-provided stack of functions can be executed (Listing 2, line 21) 
to react on the exception. If there is no on-exception-function or neither of them 
returns true the exception is re-thrown to the next higher level, e.g., from the linear 
solver to the application level, or in case of nested solvers, e.g. in optimisation or 
uncertainty quantification. 

Furthermore, there are two additional entry points for user provided function 
stacks: In line 5 of Listing 2 a stack of recovery functions is executed and if it 
returns true, the solver expects that some modification, i.e., recovery, has been done. 
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In this case it could be necessary that the other participating ranks have to update 
some data, like resetting their local right hand side to the initial values. The backup 
function stack in line 16 allows the user to provide functions for backup creation 
etc., after an iteration finished successfully. 


Recovery Approaches First, regardless of these solver modifications, we describe 
the recovery concepts which are implemented into an exemplary recovery inter- 
face class providing functions that can be passed to the entry points within the 
modified solver. The interoperability of these components and the available backup 
techniques are described later. Our recovery class supports three different methods 
to recover from a data loss. The first approach is a global rollback to a backup, 
potentially involving lossy compression: progress on non-faulty ranks may be lost 
but the restored data originate from the same state, i.e., iteration. This means there 
is no asynchronous progression in the recovered iterative process but possibly 
just an error introduced through the used backup technique, e.g., through lossy 
compression. This compression error can reduce the quality of the recovery and 
lead to additional iterations of the solver, but is still superior to a restart, as seen 
later. For the second and third approaches, we follow the local-failure local-recovery 
strategy and re-initialize the data which are lost on the faulty rank by using a 
backup. The second, slightly simpler strategy uses these data to continue with solver 
iterations. The third method additionally smoothes out the probably deteriorated 
(because of compression) data by solving a local auxiliary problem [29, 31]. This 
problem is set up by restricting the global operator to its purely local degrees of 
freedom with indices F C N and a Dirichlet boundary layer. The boundary layer 
can be obtained by extending F to some set J using the ghost layer, or possibly 
the connectivity pattern of the operator A. The Dirichlet values on the boundary 
layer are set to their corresponding values xy on the neighbouring ranks and thus 
additional communication is necessary: 


AF, F)x(F) = bF) inf 
X= xy on NF 


If this problem is solved iteratively and backup data are available, the computa- 
tion speed can be improved by initializing x with the data from the backup. 


Backup Techniques Our current implementation provides two different techniques 
for compressed backups as well as a basic class which allows ‘zero’-recovery 
(zeroeing of lost data) if the user wants to use the auxiliary solver in case of data 
loss without storing any additional data during the iterative procedure. 

The next backup class uses a multigrid hierarchy for lossy data compression. 
Thus it should only be used if a multigrid operator is already in use within 
the solving process because otherwise the hierarchy has to be built beforehand 
and introduces additional overhead. Compressing the iterative vector with the 
multigrid hierarchy currently involves a global communication. In addition there 
is no adaptive control of the compression depth (i.e., hierarchy level where the 
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backup is stored), but it has to be specified by the user, see a previous publication 
for details [29]. 

We also implemented a compressed backup technique based on SZ compres- 
sion [41]. SZ allows compression to a specified accuracy target and can yield better 
compression rates than multigrid compression. The compression itself is purely 
local and does not involve any additional communication. We provide an SZ backup 
with a fixed user-specified compression target as well as a fully adaptive one which 
couples the compression target to the residual norm within the iterative solver. For 
the first we achieve an increased rate while we approach the approximate solution, 
as seen in Fig. 2 (top, pink lines), at the price of an increased overhead in case of a 
data loss (cf. Fig. 3). The backup with adaptive compression target (blue lines) gives 
more constant compression rates, and a better recovery in case of faults in particular 
in the second half of the iterative procedure of the solver. 

The increased compression rate for the fixed SZ backup is obtained because, dur- 
ing the iterative process, the solution gets more smooth and thus can be compressed 
better by the algorithm. For the adaptive method this gain is counteracted by the 
demand of a higher compression accuracy. 

All backup techniques require to communicate a data volume smaller than the 
volume of four full checkpoints, see Fig. 2 (bottom). Furthermore this bandwidth 
requirement is distributed over all 68 iterations (in the fault-free scenario) and could 
be decreased further by a lower checkpoint frequency. 

The chosen backup technique is initiated before the recovery class and passed to 
it. Further backup techniques can be implemented by using the provided base class 
and overloading the virtual functions. 


Bringing the Approaches Together The recovery class provides three functions 
which are added to the function stacks within the modified solver interface. The 
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Fig. 2 Compression rate in the iterative solution for an anisotropic Poisson problem on 52 cores 
with approximately 480 K DOF per core 
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backup routine is added to the stack of backup functions of the specified iterative 
solver and generates backups of the current iterative solution by using the provided 
backup class. 


SomeSolver solver; 

SomeBackup backup; 

Recovery recovery (backup) ; 

solver .addBackupFunction (&Recovery::backup, &recovery) ; 


To adapt numerical as well as communication overhead for different fault 
scenarios and machine characteristics, the backup creation frequency can be varied. 
After the creation of the backup it is sent to a remote rank where it is kept in memory 
but never written to disk. In the following this is called ‘remote backup’. Currently 
the backup propagation happens circular by rank. It is also possible to trigger writing 
a backup to disk. 

In the near future we will implement an on-the-fly recovery if an exception 
is thrown. These will be provided to the other two function stacks and will 
differ depending on the availability of the ULFM extensions: if the extension 
is not available we can only detect and propagate exceptions but not recover a 
communicator in case of hard faults, i.e., node losses (cf. Sect. 2.2). In this scenario 
the function provided to the on-exception stack will only write out the global state. 
Fault-free nodes will write the data of the current iterative vector, whereas for 
faulty nodes the corresponding remote backup is written. In the following the user 
will be able to provide a flag to the executable which modifies the backup object 
initiation to read in the stored checkpoint data. Afterwards the recovery function of 
our interface will overwrite the initial values of the solver with the checkpointed and 
possibly smoothed data like described above. If the ULFM extensions are available, 
the recovery can be realised without any user interaction: during the backup class 
initiation a global communication ensures that it is the first and therefore fault-free 
start of the parallel execution. If the process is a respawned one which replaces a 
lost rank, this communication is matched by a send communication created from 
the rank which holds the corresponding remote backup. This communication will 
be initiated by the on-exception function. In addition to this message the remote 
backup rank sends the stored compressed backup so that the respawned rank can 
use this backup to recover the lost data. 

So far, we have not fully implemented rebuilding the solver and preconditioner 
hierarchy, and the re-assembly of the local systems, in case of a node loss. 
This can be done with, e.g., message logging [13], or similar techniques which 
allow recomputing the individual data on the respawned rank without additional 
communication. 

Figure 3 shows the effect of various combinations of different backup and 
recovery techniques in case of a data loss on one rank after iteration 60. The problem 
is an anisotropic Poisson problem with zero Dirichlet boundary conditions which 
reaches the convergence criterion after 68 iterations in a fault-free scenario (black 
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Fig. 3 Convergence history in case of data loss and recovery on one rank, same setting as in Fig. 2. 
Bottom left: number of iterations to solve the auxiliary problem when using the backups as initial 
guess. Note that the groups of the same colour are important, not the individual graphs 


line). It is executed in parallel on 52 ranks with approximately 480,000 degrees of 
freedom per rank. Thus one rank loss corresponds to a loss of around 2% of data. 
For solving a conjugate gradient solver with an algebraic multigrid preconditioner 
is applied. In addition to the residual norm we show the number of iterations which 
are needed to solve the auxiliary problem when using different backups as initial 
guess at the bottom left. 

The different backup techniques are colour-coded (multigrid: red; adaptive SZ 
compression: blue; fixed SZ compression: pink; no backup: green). For the SZ 
techniques we consider two cases, each with a different compression accuracy 
(fixed compression), respectively a different additional scaling coefficient (SZ). 
Recovery techniques are coded with different line styles: global roll-back recovery 
is indicated by straight lines; simple local recovery is shown with dotted lines 
and if an auxiliary problem is solved to improve the quality of the recovery it 
is drawn with a dashed line style. We observe that a zero recovery, multigrid 
compression and a fixed SZ backup with a low accuracy target are not competitive if 
no auxiliary problem is solved. The number of iterations needed until convergence 
then increases significantly. By applying an auxiliary solver the convergence can be 
almost fully restored (one additional global iteration) but the auxiliary solver needs 
a high amount of iterations (multigrid: 28; sz: 70; no backup: 132). Other backup 
techniques only need 8 auxiliary solver iterations. When using adaptive or very 
accurate fixed SZ compression the convergence behaviour can be nearly preserved 
even when only a local recovery or a global roll-back is applied. The adaptive 
compression technique has similar data overhead as the fixed SZ compression 
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(cf. Fig. 2, bottom) but gives slightly better results: both adaptive SZ compression 
approaches introduce only one additional iteration for all recovery approaches. 
For the accurate fixed SZ compression (SZfixed_*_le-7) we have two additional 
iterations when using local or global recovery but if we apply the auxiliary solver 
we also have only one additional iteration until convergence. 


2.4 Communication Aware Krylov Solvers 


In Krylov methods multiple scalar products per iteration must be computed. This 
involves global sums in a parallel setting. As a first improvement we merged the 
evaluation of the convergence criterion to the computation of a scalar product. 
Obviously this does not effect the computed values, but the iteration terminates one 
iteration later. However this reduces the number of global reductions per iteration 
from 3 to 2 and thus already saves communication overhead. 

As a second step we modify the algorithm, such that only one global com- 
munication is performed per iteration. This algorithm can also be found in the 
paper of Chronopoulos and Gear [15]. Another optimization is to overlap the two 
scalar products with the application of the operator and preconditioner, respectively. 
This algorithm was first proposed by Gropp [27]. A fully elaborate version was 
then presented by Ghysels and Vanroose [27]. This version only needs one global 
reduction per iteration, which is overlapped with both the application of the 
preconditioner and operator. This algorithm is shown in Algorithm 2. 


Algorithm 1 PCG 
ro = b — Axo 
pı = Mro 


pı = (P1, ro) 

fori = 1,... do 
qi = Api 
a; = (Di, qi) 
Xi = Xi-1 + f pi 
ri = ři-1 7 f qi 
zi+1 = Mri 
break if ||r; || < € 
Pit = (Zi+1, Ti) 


_ Piy 
Pitt = Epi + Ziti 
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Algorithm 2 Pipelined CG 


ro = b — Axo 
pı = Mro 
qı = Apı 
pı = (p1, ro) 
a1 = (pi, q1) 
sı = Mqı 
ti = As, 


fori = 1,... do 


ri = Ti- — gi 
break if ||r;|| < € 

Zi+1 = Zi — fsi 
Wi+1 = Wi — Ati 

pi+1 = (Zi+1, Ti) 
iti = (Zi+1, Wi+1) 
i+ = Ta + Qi41 
vi+ı = Mwi+ı 

Ui4ı = AVi 

Si+1 = Pits; + Vi4 
G41 = Pilt + uiy 
Pi+1 = Pal p; + Zi41 
qi+1 = Pqi + Wi+1 
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With the new communication interface, described above, we are able to compute 
multiple sums in one reduction pattern and overlap the communication with 
computation. To apply these improvements in Krylov solvers the algorithm must 
be adapted, such that the communication is independent of the overlapping com- 
putation. For this adaption we extend the ScalarProduct interface by a function 
which can be passed multiple pairs of vectors for which the scalar product should 
be computed. The function returns a Future which contains a std: :vector< 


field_types, once it has finished. 


Future<vector<field type>> 
dots (initializer list<tuple<X&, X&>> pairs); 


The function can be used in the Krylov methods like this: 


scalarproduct_ future = sp.dot_norm({{p,q}, {z, b}, 
// compute while communicate 

auto result = scalarproduct_future.get ()j; 

field type p_dot_q = result[0]; 

iealeuliel Ieyyjors) v4 eles te) = setevewillie (fal || p 

fielditypernorni bi- ley sare res wiid 2DE 


{b,b}}); 


The runtime improvement of the algorithm strongly depends on the problem size 
and on the hardware. On large systems the communication overhead makes up a 
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Table 1 Memory requirement, computational effort and global reductions per iteration for 
different versions of the preconditioned conjugate gradients method 


PCG | 4N |- |2 
Chronopoulos and Gear | 6N [1N E 
Gropp | 6N 2N |2 overlapped 
 Ghysels and Vanroose | 10N [5N -|L overlapped 


Speedup compared to CGSolver 
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Fig. 4 Strong scaling for (pipelined) Krylov subspace methods 


large part of the runtime. However, the maximum speedup is 3 for reducing the 
number of global reductions and 2 for overlapping communication and computation, 
compared to the standard version, so that a maximum speedup of 6 is possible. 
The optimization also increases the memory requirements and vector operations per 
iteration. An overview of runtime and memory requirements of the methods can be 
found in Table 1. 

Figure 4 shows strong scaling for different methods. The shown speedup is 
per iteration and with respect to the Dune: :CGSolver, which is the current CG 
implementation in DUNE. We use an SSOR preconditoner in an additive overlapping 
Schwarz setup. The problem matrix is generated from a 5-star Finite Difference 
model problem. With less cores the current implementation is faster than our 
optimized one. But with higher core count our optimized version outperforms it. 
The test was executed on the helics3 cluster of the University on Heidelberg, 
with 5600 cores on 350 nodes. We expect that on larger systems the speedup 
will further increase, since the communication is more expensive. The overlap of 
communication and computation does not really come into play, since the currently 
used MPI version does not support it completely. 
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3 Hardware-Aware, Robust and Scalable Linear Solvers 


In this section we highlight improved concepts for high-performance iterative 
solvers. We provide matrix-based robust solvers on GPUs using sparse approximate 
inverses and optimize algorithm parameters using machine learning. On CPUs 
we significantly improve the node-level performance by using optimal matrix- 
free operators for Discontinous Galerkin methods, specialized partially matrix-free 
preconditioners as well as vectorized linear solvers. 


3.1 Strong Smoothers on the GPU: Fast Approximate Inverses 
with Conventional and Machine Learning Approaches 


In continuation of the first project phase, we enhanced the assembly of sparse 
approximate inverses (SPAI), a kind of preconditioner that we had shown to be 
very effective within the DUNE solver before [9, 26]. Concerning the assembly 
of such matrices we have investigated three strategies regarding their numerical 
efficacy (that is their quality in approximating A~'), the computational complexity 
of the actual assembly and ultimately, the total efficiency of the amortised assembly 
combined with all applications during a system solution. For both strategies, this 
includes a decisive performance engineering for different hardware architectures 
with focus on the exploitation of GPUs. 


SPAI-1 As a starting point we have developed, implemented and tuned a fast 
SPAI-1 assembly routine based on MKL/LAPACK routines (CPU) and on the 
cuBlas/cuSparse libraries, performing up to four times faster on the GPU. This 
implementation is based on the batched solution of QR decompositions that arise in 
Householder transformations during the SPAI minimisation process. In many cases, 
we observe that the resulting preconditioner features a high quality comparable to 
Gauss-Seidel methods. Most importantly, this result still holds true when taking 
into account the total time-to-solution, which includes the assembly time of the 
SPAI, even on a single core where the advantages of SPAI preconditioning over 
forward/backward substitution during the iterative solution process are not yet 
exploited. More systematic experiments with respect to these statements as well 
as their extension to larger test architectures are currently being conducted. 


SAINV This preconditioner creates an approximation of the factorised inverse 
A`! = ZDR ofa matrix A € RY*™ with D being a diagonal, Z an upper triangular 
and R a lower triangular Matrix. 

To describe our new GPU implementation, we write the row-wise updates in 
the right-looking, outer product form of the A-biconjugation-process of the SAINV 
factorisation as follows: The assembly of the preconditioner is based on a loop over 
the existing rows i € {1,..., N} of Z (initialised as unit matrix Zy), where in every 
iteration the loop generally calls three operations, namely a sparse-matrix vector 
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Algorithm 3 Algorithm of the row-wise updates 
for (j =i+1,...,N)do 


if Djj # 0 then > check if the fraction is unequal to zero 
Ce = 2il =D 
for n = Ia ..,nnz(a@) do 
if a, > £ x maxi, j (Aij) then > here a» is the n-th entry of the vector œ 
if check(z? zit) then > Has zj already an entry at the columnindex of the n-th entry of œ ? 
add(z’; sn) 
update_minimum(z ; ) > get new min. value of j-th row 
else if nnz(z j) <w x a then > maximum number of rowentries already reached? 
insert(z jn) > insert the value wp at the fitting position 
update_minimum(z ; ) 
else if a, > min(z;) then > check if the value of øn is bigger than the minimum of z j 
replace(min(z;),Qn) > replace the old minimum with the value of ap 


update_minimum(< ; ) 


multiplication, a dot product and an update of the remaining rows i + 1,...,N 
based on a drop-parameter €. 

In our implementation we use the ELLPACK and CSR formats, pre-allocating 
a fixed amount of nonzeros of the matrix Z using w times the average number of 
nonzeros per row of A. Having a fixed row size, no reallocation of the arrays of the 
matrix format is needed and the row-wise update can be computed in parallel. This 
idea is based on the observation that while the density w for typical drop tolerances 
is not strictly limited, it generally falls into the interval ]0, 3[. As the SpMV and 
the dot kernels are well established, we take a closer look at the row-wise update, 
which is described more detailed in Algorithm 3. We first compute the values to be 
added and store them in a variable œ. Then we iterate over all nonzero entries of a 
(which of course has the same sparsity pattern as z;) and check if the computed value 
exceeds a certain drop-tolerance. If this condition is met, we have three conditions 
for an insertion into the matrix Z: 


1. Check if there is already an existing nonzero value in the j-th row at the column 
index of the value a, and search for the new minimal entry of this row. 

2. Else check if there is still place in the j-th row, so we can simply insert the value 
Q@, into that row and search for the new minimal entry of this row. 

3. Else check if the value a, is greater than the current minimum. If this condition 
is satisfied, then switch the old minimal value with a, and search for the new 
minimal entry of this row. 


If none of these conditions is met, we drop the computed value without updating 
the current column and repeat these steps for the next values unequal to zero of the 
current row. This cap of values for each row also has the following disadvantages: by 
having a too small maximum of nonzeros per row, a qualitative A-orthogonalization 
cannot be performed. To avoid this case we only take values of w greater than one, 
which seems to be sufficient. Also, if a row has already reached the maximum 
number of nonzeros, additional but relatively small values may be dropped. This 
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Fig. 5 GPU smoother comparison, isotropic and anisotropic Poisson benchmarks 


can become an issue if the sum of these small numbers leads to a relevant entry 
in a later iteration. For a comparison, Fig.5 depicts the time-to-solution for V- 
cycle multigrid using different strong smoothers on a P100 GPU. All smoothers are 
constructed using 8 Richardson iterations with (reasonably damped if necessary) 
preconditioners such as Jacobi, Gauss-Seidel, ILU-0, SPAI-1, SPAI-e and SAINV. 
We set up the benchmark case from a 2D Poisson problem in the isotropic case and 
with two-sided anisotropies in the grid to harden the problem even for well-ordered 
ILU approaches. The SPAI approaches are the best choice for the smoother on the 
GPU. 


Machine Learning Finally we started investigating how to construct approximate 
inverses using methods from Machine Learning [53]. The basic idea here is to 
treat AT! as a discrete function in the course of a function regression process. The 
neural network therefore learns how to deduct (extrapolate) an approximation of 
the inverse. Once trained with many data pairs of matrices and their inverse (a 
sparse representation of it) a neural network like a multilayer perceptron can be 
able to approximate inverses rapidly. As a starting point we have employed the 
finite element method for the Poisson equation on different domains with linear 
basis functions and have used it to generate expedient systems of equations to solve. 
Problems of this kind are usually based on sparse M-matrices with characteristics 
that can be used to reduce the calculation time and effort of the neural network 
training and evaluation. Our results show that given the pre-defined quality of the 
preconditioner (equivalent to the € in a SPAI-e method), we can by far numerically 
outperform even Gauss-Seidel. Using Tensorflow [1] and numpy [4], the learning 
algorithm can even be performed on the GPU. Here we have used a three-layered 
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fully-connected perceptron with fifty neurons in each layer plus input and output 
layers, and employed the resulting preconditioners in a Richardson method to solve 
the mentioned problem on a three times refined L-domain with a fixed number of 
degrees of freedom. The numerical effort of each evaluation of the neural network 
is basically the effort of a matrix-vector-multiplication for each layer in which the 
matrix size depends on the number of neurons per layer (M) and the non zero 
entries (N) of the input matrix, like O(N M) for the first layer. The inner layers’ 
effort, without input and output layer, just depends on the number of neurons. The 
crucial task now is to balance the quality of the resulting approximation and the 
effort to evaluate the network. We use fully connected feed-forward multilayer 
perceptrons as a starting point. Fully connected means that every neuron in the 
network is connected to each neuron of the next layer. Moreover there are no 
backward connections between the different layers (feed-forward). The evaluation 
of such neural networks is a sequence of chained matrix-vector products. 

The entries of the system matrix are represented vector-wise in the input layer 
(cf. Fig. 6). In the same way, our output layer contains the entries of the approximate 
inverse. Between these layers we can add a number of hidden layers consisting of 
hidden neurons. How many hidden neurons we need to create strong approximate 
inverses is a key design decision and we discuss this below. In general our 
supervised training algorithm is a backward propagation with random initialisation. 
Alongside a linear propagation function itota = W - Ototal + b with the total (layer) 
net input itotal, the weight matrix W, the vector for the bias weights b and the total 
output of the previous layer Ototal, we use the rectified linear unit (ReLu) function as 
activation function a(x) and thus we can calculate the output y of each neuron as 
y= aj oj - wij). Here oj is the output of the preceding sending units and wij are 
the corresponding weights between the neurons. 

For the optimization we use the L2 error function and update the weights with 


(+) _ |) 


wij wi + y -oi - 6j, with the output o; of the sending unit and learning rate 


y. 6; symbolises the gradient decent method: 


5 = f' Gj) - (6 — oj) if neuron j is an output neuron 
F'G) - Voges (Ok + wej) if neuron j is a hidden neuron. 


system matrix input hidden output approximate 
layer layer layer inverse 
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Fig. 6 Model of a neural network for matrix inversion, cf. [53] 
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Fig. 7 Results for the defect correction with the neural network, cf. [53] 


For details concerning the test/training algorithm we refer to a previous publica- 
tion [53]. For the defect correction prototype, we find a significant speedup for a 
moderately anisotropic Poisson problem, see Fig. 7. 


3.2 Autotuning with Artificial Neural Networks 


Inspired by our usage of Approximate Inverses generated by artificial neural 
networks (ANNs), we exploit (Feed Forward-) neural networks (FNN) for the 
automatic tuning of solver parameters. We were able to show that it is possible to 
use such an approach to provide much better a-priori choices for the parametrisation 
of iterative linear solvers. In detailed studies for 2D Poisson problems we conducted 
benchmarks for many test matrices and autotuning systems using FNNs as well as 
convolutionary neural networks (CNNs) to predict the w parameter in a SOR solver. 
In Fig. 8 we depict 100 randomly choosen samples of this study. It can be seen that 
even for good a-priori choices of œw the NN-driven system can compete whilst ‘bad’ 
choices (labeled constant) might lead to a stalling solver. 


3.3 Further Development of Sum-Factorized Matrix-Free DG 
Methods 


While we were able to achieve good node-level performance with our matrix-free 
DG methods in the first funding period, our initial implementations still did not 
utilize more than about 10% of the theoretical peak FLOP throughput. In the second 
funding period, we systematically improved on those results by focusing on several 
aspects of our implementation: 
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Fig. 8 Result for 100 samples of the FNN-based autotuning system for the w parameter in SOR 


Introduction of Block-Based DOF Processing Our implementation is based 
on DUNE-PDELAB, a very flexible discretization framework for both continuous 
and discontinuous discretizations of PDEs. In order to support a wide range of 
discretizations, PDELab has a powerful system for mapping DOFs to vector and 
matrix entries. Due to this flexibility, the mapping process is rather expensive. On 
the other hand, Discontinuous Galerkin values will always be blocked in a cell-wise 
manner. This can be exploited by only ever mapping the first degree of freedom 
associated with each cell and then assuming that all subsequent values for this cell 
are directly adjacent to the first entry. We have added a special ‘DG codepath’ to 
DUNE-PDELAB which implements this optimization. 


Avoiding Unnecessary Memory Transfers As all of the values for each cell are 
stored in consecutive locations in memory, we can further optimize the framework 
behavior by skipping the customary gather/scatter steps before and after the 
assembly of each cell and facet integral. This is implemented by replacing the 
data buffer normally passed to the integration kernels with a dummy buffer that 
stores a pointer to the first entry in the global vector/matrix and directly operates on 
the global values. This is completely transparent to the integration kernels, as they 
only ever access the global data through a well-defined interface on these buffer 
objects. Together with the previous optimization, these two changes have allowed 
us to reduce the overhead of the framework infrastructure on assembly times from 
more than 100% to less than 5%. 


Explicit Vectorization The DG implementation used in the first phase of the 
project was written as scalar code and relied on the compiler’s auto vectorization 
support to utilize the SIMD instruction set of the processor, which we tried to 
facilitate by providing compile time loop bounds and aligned data structures. In 
the second phase, we have switched to explicit vectorization with a focus on AVX2, 
which is a common foundation instruction set across all current x 86-based HPC 
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processors. We exploit the possibilities of our C++ code base and use a well- 
abstracted library which wraps the underlying compiler intrinsic calls [23]. In a 
separate project [34], we are extending this functionality to other SIMD instruction 
sets like AVX512. 


Loop Reordering and Fusion While vectorization is required to fully utilize 
modern CPU architectures, it is not sufficient. We also have to feed the execution 
units with a substantial number of mutually independent chains of computation 
(*40-50 on current CPUs). This amount of parallelism can only be extracted from 
typical DG integration kernels by fusing and reordering computational loops. In 
contrast to other implementations of matrix-free DG assembly [22, 43], we do not 
group computations across multiple cells or facets, but instead across quadrature 
points and multiple input/output variables. In 3D, this works very well for scalar 
PDEs that contain both the solution itself and its gradient, which adds up to four 
quantities that exactly fit into an AVX2 register. 


Results Table 2 compares the throughput and the hardware efficiency of our 
matrix-free code for two diffusion-reaction problems A (axis-parallel grid, constant 
coefficients per cell) and B (affine geometries, variable coefficients per cell) with 
a matrix-based implementation. Figure 9 compares throughput and floating point 
performance of our implementation for these problems as well as an additional 
problem C with multi-linear geometries, demonstrating that we are able to achieve 
more than 50% of theoretical peak FLOP rate on this machine as well as a good 
computational processing rate as measured in DOFs/s. 

While our work in this project was mostly focused on scalar diffusion-advection- 
reaction problems, we have also applied the techniques shown here to projection- 
based Navier-Stokes solvers [51]. One important lesson learned was the unsustain- 
able amount of work required to extend our approach to different problems and/or 
hardware architectures. This led us to develop a Python-based code generator in 
a new project [34], which provides powerful abstractions for the building blocks 
listed above. This toolbox can be extended and combined in new ways to achieve 
performance comparable to hand-optimized code. Especially for more complex 
problems involving systems of equations, there are a large number of possible ways 
to group variables and their derivatives into sum factorization kernels due to our 
approach of vectorizing over multiple quantities within a single cell. The resulting 
search space is too large for manual exploration, which the above project solved by 
the addition of benchmark-driven automatic comparison of those variants. Finally, 
initial results show good scalability of our code as shown by the strong scaling 
results in Fig. 10. Our implementation shows good scalability until we reach a local 
problem size of just 18 cells, where we still need to improve the asynchronicity of 
ghost data communication and assembly. 
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Fig. 9 Floating point performance in GFLOPs/s and throughput in MDOFs/s for full operator 
application, 2x Intel Xeon E5-2698v3 2.3 GHz for all model problems 
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Fig. 10 Runtimes for strong scalability on IWR compute cluster (416 nodes with 2x ES-2630 v3 
each, 64 GiB/node, QDR Infiniband) 


3.4 Hybrid Solvers for Discontinuous Galerkin Schemes 


In Sect.3.3 we concentrated on the performance of matrix-free operator appli- 
cation. This is sufficient for instationary problems with explicit time integration, 
but in case of stationary problems or implicit time integration, (linear) algebraic 
systems need to be solved. This requires operator application and robust, scalable 
preconditioners. 

For this we extended hybrid AMG-DG preconditioners [8] in a joint work with 
Eike Miiller from Bath University, UK, [10]. In a solver for matrices arising from 
higher order DG discretizations the basic idea is to perform all computations on 
the DG system in a matrix-free fashion and to explicitly assemble only a matrix 
in a low-order subspace which is significantly smaller. In the sense of subspace 
correction methods [58] we employ a splitting 


Vo= >) VE +Ve 
TETh 
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where vÈ is the finite element space of polynomial degree p on element T and the 
coarse space V, is either the lowest-order conforming finite element space v! on the 
mesh 7p, or the space of piecewise constants ve. Note that the symmetric weighted 
interior penalty DG method from [21] reduces to the cell-centered finite volume 
method with two-point flux approximation on ve. Note also, that the system on Ve 
can be assembled without assembling the large DG system. 

For solving the blocks related to Ve , two approaches have been implemented. 
In the first (named partially matrix-free), these diagonal blocks are factorized using 
LAPACK and each iteration uses a backsolve. In the second approach the diagonal 
blocks are solved iteratively to low accuracy using matrix-free sum factorization. 
Both variants can be used in additive and multiplicative fashion. Figure 11 shows 
that the partially matrix-free variant is optimal for polynomial degree p < 5, but 
starting from p = 6, the fully matrix-free version starts to outperform all other 
options. 

In order to demonstrate the robustness of our hybrid AMG-DG method we use 
the permeability field of the SPE10 benchmark problem [14] within a heterogeneous 
elliptic problem. This is considered to be a hard test problem in the porous media 
community. The DG method from [21] is employed. Figure 12 depicts results for 
different variants and polynomial degrees run in parallel on 20 cores. A moderate 
increase with the polynomial degree can be observed. With respect to time-to- 
solution (not reported) the additive (block Jacobi) partially matrix-free variant is 
to be preferred for polynomial degree larger than one. 
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Fig. 11 Total solution time for different implementations and a range of block-solver tolerances € 
for the Poisson problem (left) and the diffusion problem with spatially varying coefficients (right), 
cf. [10] 
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Fig. 12 Convergence history for SPE10 benchmark. The relative energy norm is shown for 
polynomial degrees | (red squares), 2 (blue upward triangles) and 3 (green downward triangles). 
Results for the block-SSOR smoother are marked by filled symbols and results for the block-Jacobi 
smoother by empty symbols. cf. [10] 


3.5 Horizontal Vectorization of Block Krylov Methods 


Methods like Multiscale FEM (see Sect. 4), optimization and inverse problems need 
to invert the same operator for many right-hand-side vectors. This leads to a block 
problem, by the following conceptual reformulation: 


foreach i € [0, N]: solve Ax; = bi > solve AX = B, 


with matrices X = (xo,...xn), B = (bọ, ...by). Such problems can be solved 
using Block Krylov solvers. The benefit is that the approximation space can grow 
faster, as the solver orthogonalizes the updates for all right-hand-sides. Even for a 
single right-hand-side Block Krylov based enriched Krylov methods can be used to 
accelerate the solution process. 

Preconditioners and the actual Krylov solver can be sped up using horizontal 
vectorization. Assuming k right-hand-sides we observe that the scalar product 
yields a k x k dense matrix and has O(k*) complexity. While the mentioned 
larger approximation space should improve the convergence rate, this is only 
true for weaker preconditioners, therefore we pursued a different strategy and 
approximate the scalar product matrix by a sparse matrix, so that we again retain 
O (k) complexity. In particular we consider the case of a diagonal or block-diagonal 
matrix. The diagonal matrix basically results in k independent solvers running in 
parallel, so that the performance gain is solely based on SIMD vectorization and the 
associated favorable memory layout. 
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Fig. 13 Horizontal vectorization of a linear solver for 256 right-hand-side vectors. Timings on a 
Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes). Comparison with 1-16 cores and no SIMD, 
AVX (4 lanes), AVX (4 x 4 lanes) 


For the implementation in DUNE-ISTL we use platform portable C++ abstrac- 
tions of SIMD intrinsics, building on the VC library[38] and some DUNE specific 
extensions. We use this to exchange the underlying data type of the right-hand-side 
and the solution vector, so that we no longer store scalars, but SIMD vectors. This 
is possible when using generic programming techniques, like C++ templates, and 
yields a row-wise storage of the dense matrices X and B. This row-wise storage is 
optimal and ensures a high arithmetic intensity. The implementations of the Krylov 
solvers have to be adapted to the SIMD data types, since some operations, like casts 
and branches, are not available generically for SIMD data types. As a side effect, all 
preconditioners, including the AMG, are now fully vectorized. 

Performance tests using 256 right-hand-side vectors for a 3D Poisson problem 
show nearly optimal speedup on a 64 core system (see Fig. 13). The tests are carried 
out on a Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes). We observe a speedup 
of 50, while the theoretical speedup is 64. 


4 Adaptive Multiscale Methods 


The main goal in the second funding phase was a distributed adaptive multilevel 
implementation of the localized reduced basis multi-scale method (LRBMS [49]). 
Like Multiscale FEM (MsFEM), LRBMS is designed to work on heterogenous 
multiscale or large scale problems. It performs particularly well for problems that 
exhibit scale separation with effects on both a fine and a coarse scale contributing 
to the global behavior. Unlike MsFEM, LRBMS is best applied in multi-query 
settings in which a parameterized PDE needs to be solved many times for different 
parameters. As an amalgam of domain decomposition and model order reduction 
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techniques, the computational domain is partitioned into a coarse grid with each 
macroscopic grid cell representing a subdomain for which, in an offline pre-compute 
stage, local reduced bases are constructed. Appropriate coupling is then applied 
to produce a global solution approximation from localized data. For increased 
approximation fidelity we can integrate localized global solution snapshots into the 
bases, or the local bases can adaptively be enriched in the online stage, controlled 
by a localized a-posteriori error estimator. 


4.1 Continuous Problem and Discretization 


We consider elliptic parametric multi-scale problems on a domain Q C R? where 
we look for p(«) € Q that satisfy 


b(p( n), q; #) = lq) for allg € Hi (9), (1) 


je are parameters with y € P C RP, p e N. We let € > 0 be the multi-scale 
parameter associated with the fine scale. For demonstration purposes we consider 
a particular linear elliptic problem setup in Q C R? (d = 2,3) that exhibits a 
multiplicative splitting in the quantities affected by the multi-scale parameter e. It 
is a model for the so called global pressure p(f) € Hy (Q2) in two phase flow in 
porous media, where the total scalar mobility à (x) is parameterized. ke denotes the 
heterogenous permeability tensor and f the external forces. Hence, we seek p that 
satisfies weakly in Hy (2), 


-V-(A(w)keVp(u)) = f ing. (2) 


With A(x; p) := A(w)K_ (x) this gives rise to the following definition of the forms 
in (1) 


b(p( a), 4; W) = | Awp: va. lq) = f fa 


For the discretization we first require a triangulation TH of Q for the macro 
level. We call the elements T € TH subdomains. We then require each subdomain 
be covered with a fine partition t,(T) in a way that TH and ty := ETeTyTh(T) are 
nested. We denote by Fy the faces of the coarse triangulation and by Fp the faces 
of the fine triangulation. 

Let V (th) C H? (tn) denote any approximate subset of the broken Sobolev space 
H? (th) := {q € LQ) | ql: € H(t) Vt € th}. We call pp(m) € V(t) an 
approximate solution of (1), if 


bn(Ph(e), v; p) = lnv; p) for all v € V (ta). (3) 
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Here, the DG bilinear form by and the right hand side l} are chosen according to the 
SWIPDG method [20], i.e. 


by(v, w; p) =F f Awye- vw + > bo, ws p) 
t 


tETh eEFf(tn) 


In(u; u) := > ft. 
t 


tETh 


where the DG coupling bilinear forms by, for a face e is given by 


Oe (M) 
leļf 


b (v, w; p) = 1 (A(w) Vu + ne)[w] + (AV w - ne)[v] + AA vlw]. 


The LRBMS method allows for a variety of discretizations, i.e. approximation 
spaces V (t). As a particular choice of an underlying high dimensional approxima- 
tion space we choose V (th) = oF = Orery OF, where the discontinuous local 
spaces are defined as 


OFT := OFT (tA(T)) := {q € LCT) | qli € Pe) Yt € ta(T)}- 


4.2 Model Reduction 


For model order reduction in the LRBMS method we choose the reduced space 
Orea = PreTy Qla Cc oF with local reduced approximation spaces Qla C 


oF, We denote prea (4) to be the reduced solution of (3) in Qrea. This formulation 
naturally leads to solving a sparse blocked linear system similar to a DG approxi- 
mation with high polynomial degree on the coarse subdomain grid. 

The construction of subdomain reduced spaces Qla is again very flexible. 
Initialization with shape functions on T up to order k ensures a minimum fidelity. 
Basis extensions can be driven by a discrete weak greedy approach which incor- 
porates localized solutions of the global system. Depending on available com- 
putational resources, and given a suitable localizable a-posteriori error estimator 
n(pPrea (M), M), we can forego computing global high-dimensional solutions alto- 
gether and only rely on online enrichment to extend Qla ‘on the fly’. With online 
enrichment, given a reduced solution Prea(f) for some arbitrary u € P, we first 
compute local error indicators n7 (Prea(M), a) for all T € Tp. If n7 (Prea (M), BL) 
is greater than some prescribed bound 449) > 0, we solve on a overlay region 
N(T) D T and extend Ol, with pyr)(#)|7-. Inspired by results in [17] we set 
the overlay region’s diameter diam(N(T )) of the order O(diam(T)|log(diam(T))|). 
In practice we use the completely on-line/off-line decomposable error estimator 
developed in [49, Sec. 4] which in turn is based on the idea of producing a 
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conforming reconstruction of the diffusive flux A(w)Ke Vn pn( m) in some Raviart- 
Thomas-Nédélec space Vj (tn) C Agiy(&) presented in [21, 56]. 

This process is then repeated until either a maximum number of enrichment steps 
occur or n” (Pred (Mt), M) < Sto1- 


Algorithm 4 Schematic representation of the LRBMS pipeline 
Require: Pirain C P 
Require: Reconstruction operator Rj (prea (M)) : Qrea(T H) > Ok (th) 


1: function GREEDYBASISGENERATION(ogrdy, (Pred (M), #)=None ) 

2: if n(Prea (M), m) is not None then 

3: E < {n(Prea(ui), Hi) | Mi E€ Pirain} 

4: else 

5: E <+ {IR} (Prea (Hi) = Pr(eeil | Hi € Prrain} 

6: while E # Ø AND max(E) = Sgray do 

T i < argmax(E) 

8: compute pp (Mi) 

9: for all T € Ty do 
10: extend orm with pp (e)|7 
11: E<E\E; 
12: Generate TH > Offline Phase 
13: for all T € Ty do 
14: create Tt, (7) 
15: init Oli with DG shape functions of order k 
16: GREEDYBASISGENERATION(: --) > Optional 
17: compute prea (4) for arbitrary u > Online phase 
18: for all T € Ty do > Optional Adaptive Enrichment 


19: 1 < 7" (Prea (ht), m) 
20: while 7 > 5,,) do 


21: compute pyr) (KL) 
22: Of <— PNT) Wr 


4.3 Implementation 


We base our MPI-parallel implementation of LRBMS on the serial version devel- 
oped previously. In this setup the high-dimensional quantities and all grid structures 
are implemented in DUNE. The model order reduction as such is implemented in 
Python using pyMOR [45]. The model reduction algorithms in pyMOR follow a 
solver agnostic design principle. Abstract interfaces allow for example projections, 
greedy algorithms or reduced data reconstruction to be written without knowing 
details of the PDE solver backend. The global macro grid Ty can be any 
MPI-enabled DUNE grid manager with adjustable overlap size for the domain 
decomposition, we currently use DUNE-YASPGRID. The fine grids t,(T) are 
constructed using the same grid manager as on the macro scale, with MPI subcom- 
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municators.These are currently limited to a size of one (rank-local), however the 
overall scalability could benefit from dynamically sizing these subcommunicators 
to balance communication overhead and computational intensity as demonstrated in 
[36, Sec. 2.2]. The assembly of the local (coupling) bilinear forms is done in DUNE- 
GDT [24], with pyMOR/Python bindings facilitated through DUNE-XT [46], where 
DUNE-GRID-GLUE [19] generates necessary grid views for the SWIPDG coupling 
between otherwise unrelated grids. Switching to DUNE-GRID-GLUE constitutes a 
major step forward in robustness of the overall algorithm, compared to our previous 
manually implemented approach to matching independent local grids for coupling 
matrices assembly. 

We have identified three major challenges in parallelizing all the steps in 
LRBMS: 


1. Global solutions p;, (2) of the blocked system in Eq. (3) with an appropriate 
MPI-parallel iterative solver. With the serial implementation already using 
DUNE-ISTL as the backend for matrix and vector data, we only had to generate 
an appropriate communication configuration for the blocked SWIPDG matrix 
structure to make the BiCGStab solver usable in our context. We tested this setup 
on the SuperMUC Petascale System in Garching. The results in Fig. 14 show very 
near ideal speedup from 64 nodes with 1792 MPI ranks up to a full island with 
512 nodes and 14336 ranks. 
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Fig. 14 Localized Reduced Basis Method: Block-SWIPDG speedup results; linear system solve 
(green), discretization and system assembly (blue), theoretic ideal speedup (violet) and actual 
achieved speedup for the overall run time (red). Simulation on ~7.9 - 10° cubical cells shows 
minimum 94% parallel efficiency, scaling from 64 to 512 nodes (SuperMUC Phase 2) 
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2. (Global) Reduced systems also need a distributed solver. By design all 
reduced quantities in pyMOR are, at the basic, unabstracted level, NumPy 
arrays [57]. Therefore we cannot easily re-use the DUNE-ISTL based solvers for 
the high-dimensional systems. Our current implementation gathers these reduced 
system matrices from all MPI-ranks to rank 0, recombines them, solves the 
system with a direct solver and scatters the solution. There is great potential in 
making this step more scalable by either using a distributed sparse direct solver 
like Mumps [3] or translating the data into the DUNE-ISTL backend. 

3. Adaptive online enrichment is inherently load imbalanced due to its local- 
ized error estimator guidance. The load imbalance results from one rank idling 
while waiting to receive updates to a basis on a subdomain in its overlap region 
from another rank. This idle time can be minimized by encapsulating the update 
in a MPIFuture described in Sect. 2.1. This will allow the rank to continue in its 
own enrichment process until the updated basis is actually needed in a subsequent 
step. 


5 Uncertainty Quantification 


The solution of stochastic partial differential equations (SPDEs) is characterized by 
extremely high dimensions and poses great (computational) challenges. Multilevel 
Monte Carlo (MLMC) algorithms attract great interest due to their superiority over 
the standard Monte Carlo approach. Based on Monte Carlo (MC), MLMC retains 
the properties of independent sampling. To overcome the slow convergence of MC, 
where many computationally expensive PDEs have to be solved, MLMC combines 
in a proper way cheap MC estimators and expensive MC estimators, achieving 
(much) faster convergence. One of the critical components of the MLMC algorithms 
is the way in which the coarser levels are selected. The exact definition of the 
levels is an open question and different approaches exist. In the first funding phase, 
Multiscale FEM was used as a coarser level in MLMC. During the second phase, 
the developed parallel MLMC algorithms for uncertainty quantification were further 
enhanced. The main focus was on exploring the capabilities of the renormalization 
approach for defining the coarser levels in the MLMC algorithm, and on using 
MLMC as a coarse grained parallelization approach. 

Here, we employ MLMC to exemplarily compute the mean flux through 
saturated porous media with prescribed pressure drop and known distribution of 
the random coefficients. 


Mathematical Problem As a model problem in R? or RÌ, we consider steady state 
single phase flow in random porous media: 


—V -[k(x, @)V p(x, w)] = 0 for x € D = (0, 1)? 8 EQ 
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subject to the boundary conditions py=o = 1l and py=; = O and zero flux on 
other boundaries. Here p is pressure, k is scalar permeability, and w is a random 
vector. The quantity of interest is the mean (expected value) E[Q] of the total flux 
Q through the inlet of the unit cube i.e., Q(x, œ) := fok, @)0n p(x, w)dx. 
Both the coefficient k(x, œ) and the solution p(x, œ) are subject to uncertainty, 
characterized by the random vector w in a properly defined random space Q. For 
generating permeability fields we consider the practical covariance C(x, y) = 
o7exp(—||x—y||2/A). An algorithm based on forward and inverse Fourier transform 
over the circulant covariance matrix is used to generate the permeability field. For 
solving the deterministic PDEs a Finite Volume method on a cell centered grid is 
used [32]. More details and further references can be found in a previous paper [47]. 


Monte Carlo Simulations To quantify the uncertainty, and compute the mean of 
the flux we use a MLMC algorithm. Let wy be a random vector over a properly 
defined probability space, and Qm be the corresponding flux. It is known that 
E[ Q m] can be made arbitrarily close to E[Q] by choosing M sufficiently large. The 
standard MC algorithm convergences very slowly, proportionally to the variance 
over the square root of the number of samples, which makes it often unfeasible. 
MLMC introduces L levels with the L-th level coinciding with the considered 
problem, and exploits the telescopic sum identity: 


ElO%,(o)] = E09, (@)] + ELC (@) — 08, (ol +... ELO) — OF 1(0)] 


The notation Y! = Q! — Q’! is also used. The main idea of MLMC is to 
properly define levels, and combine a large number of cheap simulations, that are 
able to approximate the variance well, with a small number of expensive correction 
simulations providing the needed accuracy. For details on Monte Carlo and MLMC 
we refer to previous publications [32, 47] and the references therein. Here, the target 
is to estimate the mean flux on a fine grid, and we define the levels as discretizations 
on coarser grids. In order to define the permeability at the coarser levels we use the 
renormalization approach. 

MLMC has previously run the computations at each level with the same 
tolerance. However, in order to evaluate the number of samples needed per level, 
one has to know the variance at each level. Because the exact variance is not known 
in advance, MLMC starts by performing simulations with a prescribed, moderate 
number of samples per level. The results are used to evaluate the variance at each 
level, and thus to evaluate the number of samples needed per level. This procedure 
can be repeated several times in an Estimate—Solve cycle. At each estimation step, 
information from all levels is needed, which leads to a synchronization point in the 
parallel realization of the algorithm. This may require dynamic redistribution of the 
resources after each new evaluation. 

MLMC can provide a coarse graining in the parallelization. A well balanced 
algorithm has to account for several factors: (1) How many processes should be 
allocated per level; (2) how many processes should be allocated per deterministic 
problem including permeability generation; (3) how to parallelize the permeability 
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generation; (4) which of the parallelization algorithms for deterministic problems 
available in EXA-DUNE should be used; (5) should each level be parallelized 
separately and if not, how to group the levels for efficient parallelization. The last 
factor is the one giving coarse grain parallelization opportunities. For the generation 
of the permeability, we use the parallel MPI implementation of the FFTW library. 
As deterministic solver, we use a parallel implementation of the conjugate gradient 
scheme preconditioned with AMG, provided by DUNE-ISTL. Both of them have 
their own internal domain decomposition. 

We shortly discuss one Estimate-Solve cycle of the MLMC algorithm. Without 
loss of generality we assume 3-level MLMC. Suppose that we have already 
computed the required number of samples per level (i.e., we are after Estimate 
and before Solve). Let us denote by N;,i = {0,1,2} the number of required 
realizations per level for Y;, by p; the number of processes allocated per Y;, by 
Pi, the respective group size of processes working on a single realization, by n 
the number of realizations for each group of levels, with t; the respective time for 
solving a single problem once, and finally with p'è! the total number of available 
processes. Then we can compute the total CPU time for the current Estimate-Solve 
cycle as 


total 


Topy = Noto + Niti + Noto. 


Ideally each process should take Tgp, = Té(#!/p'!. Dividing the CPU time 
needed for one f; by Tepi: we get a continuous value for the number of processes on 
a given level pil = Niti / Tépy for i = {0, 1, 2}. Then we can take p; = | p]: 
To obtain an integer value for the number of processes allocated per level, first we 
construct a set of all possible splits of the original problem as a combination of 
subproblems (e.g., parallelize level 2 separately and the combination of levels 0 and 
1, or parallelize all levels simultaneously, etc.). Each element of this set is evaluated 
independently, and all combinations of upper and lower bounds are calculated, 
such that p'4@! is divisible by Pi» Ai pi < p*! and p; < N; Pi- Traversing, 
computing and summing the computational time needed for each element gives us 
a time estimation. Then we select the element (grouping of levels) with minimal 
computational time. To tackle the distribution of the work on a single level, a similar 
approach can be employed. Due to the large dimension of the search tree a heuristic 
technique can be employed. Here we consider a simple predefined group size for 
each deterministic problem, having in mind that when using AMG the work for a 
single realization at the different levels is proportional to the unknowns at this level. 


Numerical Experiments Results for a typical example are shown in Fig. 15. The 
parameters are o = 2.75, = 0.3. The tests are done on SuperMUC-NG, LRZ 
Munich on a dual Intel Skylake Xeon Platinum 8174 node. Due to the stochasticity 
of the problem, we plot the speedup multiplied with the proportion of the tolerance. 
The renormalization has shown to be a very good approach for defining the coarser 
levels in MLMC. The proposed parallelization algorithm gives promising scalability 
results. It is weakly coupled to the number of samples that MLMC estimates. 
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Fig. 15 Scalability of the MLMC approach 


Although the search for an optimal solution is an NP-hard problem, the small 
number of levels enables a full traversing of the search tree. It can be further 
improved by automatically selecting the number of processes per group that solves 
a single problem. One also may consider introducing interrupts between the MPI 
communicators on a level to further improve the performance. 


6 Land-Surface Flow Application 


To test some of the approaches developed in the EXA-DUNE project, especially 
the usage of sum-factorized operator evaluation with more complex problems, 
we developed an application to simulate coupled surface-subsurface flow for 
larger geographical areas. This is a topic with high relevance for a number of 
environmental questions from soil protection and groundwater quality up to weather 
and climate prediction. 

One of the central aims of the new approach developed in the project is to 
be able to relate a physical meaning to the parameter functions used in each 
grid cell. This is not possible with the traditional structured grid approaches as 
the necessary resolution would be prohibitive. To avoid the excessive memory 
requirements of completely unstructured grids we build on previous results for 
block-structured meshes and use a two-dimensional unstructured grid on the surface 
which is extruded in a structured way in vertical direction. However, more flexible 
discretization schemes are needed for such grids, compared to the usual cell- 
centered Finite-Volume approaches. 
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6.1 Modelling and Numerical Approach 


To describe subsurface flow we use the Richards equation [52] 


HW Y. iky) (Vy +eg)]+4u =0 

where 0 is the volumetric water content, y the soil water potential, k the hydraulic 
conductivity, e, the unit vector pointing in the direction of gravity and qw a 
volumetric source or sink term. 

In nearly all existing models for coupled surface-subsurface flow, the kinematic- 
wave approximation is used for surface flow, which only considers surface slope 
as driving force and does not even provide a correct approximation of the steady- 
state solution. The physically more realistic shallow-water-equations are used rarely, 
as they are computationally expensive. We use the diffusive-wave approximation, 
which still retains the effects of water height on run-off, yields a realistic steady- 
state solution and is a realistic approximation for flow on vegetation covered ground 
[2]: 

oh 

g [D(h, Vh)V (h + 2)] = fe, (4) 
where h is the height of water over the surface level z, fẹ is a source-sink term 
(which is used for the coupling) and the diffusion coefficient D is given by 


hæ 
D(h, Vh) = =— Sa 
C- |V + z)! 
with || - || refering to the Euclidean norm and a, y and C are empirical constants. In 


the following we use a = 3 and y = 5 to obtain Manning’s formula and a friction 
coefficient of C = 1. 

Both equations are discretised with a cell-centered Finite-Volume scheme and 
alternatively with a SWIPDG scheme in space (see Sect.3) and an appropriate 
diagonally implicit Runge-Kutta scheme in time for the subsurface and an explicit 
Runge-Kutta scheme for the surface flow. Upwinding is used for the calculation of 
conductivity in subsurface flow [5] and for the water height in the diffusion term in 
surface flow. 

First tests have shown that the formulation of the diffusive-wave approximation 
from the literature as given by Eq. (4) does not result in a numerically stable solution 
if the gradient becomes very small, as then a gradient approaching zero is multiplied 
by a diffusion coefficient going to infinity. A much better behaviour is achieved by 
rewriting the equation as 


Tae ers 
at C Yk] 
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where the rescaled gradient mes is always going to zero when V (h + z) is 


going to zero as long as y < 1 and the new diffusion coefficient A“ /C is bounded. 

Due to the very different time-scales for surface and subsurface flow, an operator- 
splitting approach is used for the coupled system. A new coupling condition has 
been implemented, which is a kind of Dirichlet-Neumann coupling, but guarantees 
a mass-conservative solution. With a given height of water on the surface (from the 
initial condition or the last time step modified by precipitation and evaporation), 
subsurface flow is calculated with a kind of Signorini boundary condition, where 
all surface water is infiltrated in one time step as long as the necessary gradient 
is not larger than the pressure resulting from the water ponding on the surface (in 
infiltration) and potential evaporation rates are maintained as long as the pressure 
at the surface is not below a given minimal pressure (during evaporation). The 
advantage of the new approach is that it does not require a tracking of the sometimes 
complicated boundary between wet and dry surface elements, that it yields no 
unphysical results and that the solution is mass-conservative even if not iterated 
until convergence. 

Parallelisation is obtained by an overlapping or non-overlapping domain-decom- 
position (depending on the grid). However, only the two-dimensional surface grid 
is partitioned whereas the vertical direction is kept on one process due to the strong 
coupling. Thus there is also no need for communication of surface water height 
for the coupling, as the relevant data is always stored in the same process. The 
linear equation systems are solved with the BiCGstab-solver from DUNE-ISTL with 
Block-ILUO as preconditioner. The much larger mesh size in horizontal direction 
compared to the vertical direction results in strong coupling of the unknowns in the 
vertical direction. The Block-ILUO scheme provides an almost exact solver of the 
strongly coupled blocks in the vertical direction and is thus a very effective scheme. 
Furthermore, one generally has a limited number of cells in the vertical direction 
and extends the mesh in horizontal direction to simulate larger regions. Thus the 
good properties of the solver are retained when scaling up the size of the system. 


6.2 Performance Optimisations 


As the time steps in the explicit scheme for surface flow can get very small due to 
the stability limit, a significant speedup can be achieved by using a semi-implicit 
scheme, where the non-linear coefficients are calculated with the solution from the 
previous time step or iteration. However, if the surface is nearly completely dry, this 
could lead to numerical problems, thus an explicit scheme is still used under nearly 
dry conditions with an automatic switching between both. 

While matrix-free DG solvers with sum-factorization can yield excellent per 
node performance (Sect. 3.3), it is a rather tedious task to implement them for new 
partial differential equations. Therefore, a code-generation framework is currently 
being developed in a project related to EXA-DUNE [33]. The framework is used to 
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implement an alternative optimized version of the solver for the Richards equation 
as this is the computationally most expensive part of the computations. Expensive 
material functions like the van Genuchten model including several power functions 
are replaced by cubic spline approximations, which allow a fast vectorized incor- 
poration of flexible material functions to simulate strongly heterogeneous systems. 
Sum-factorisation is used in the operator evaluations for the DG-discretization with 
a selectable polynomial degree. 

A special pillar grid has been developed as a first realisation of a 2.5D grid 
[33]. It adds a vertical dimension to a two-dimensional grid (which is either 
structured or unstructured). However, as the current version still produces a full 
three-dimensional mesh at the moment, future developments are necessary to exploit 
the full possibilities of the approach. 


6.3 Scalability and Performance Tests 


Extensive tests covering infiltration as well as exfiltration have been performed (e.g. 
Fig. 16) to test the implementation and the new coupling condition. Good scalability 
is achieved in strong as well as in weak scaling experiments on up to 256 nodes 
and 4096 cores of the bwForCluster in Heidelberg (Fig. 17). Simulations for a large 
region with topographical data taken from a digital elevation model (Fig. 18) have 
been conducted as well. 

With the generated code-based solver for the Richards equation a substantial 
fraction of the system’s peak performance (up to 60% on a Haswell-CPU) can be 


Time = 0 sec Time = 100 sec Time = 1000 sec Time = 2000 sec 


moega 


Top view Front view Side view Perspective 


Fig. 16 Surface runoff and infiltration of 5cm water into a dry coarse sand (top) and the 
unstructured 2.5D mesh used for the simulations (bottom) 
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Fig. 17 Results of a strong (left) and weak (right) scalability test with a coupled run-off and 
infiltration experiment on 1 to 256 nodes (16 to 4096 Intel Xeon E5-2630 v3 2.4 GHz CPU cores) 
of the bwForCluster at IWR in Heidelberg 


Surface element. 
Soil element 


Fig. 18 Pressure distribution with an overlay of the landscape taken from Google Earth calculated 
in a simulation of water transport in a real landscape south of Brunswick simulated on 30 nodes 
with 1200 cores of HLRN-IV in Göttingen (2x Intel Skylake Gold 6148 2.4 GHz CPUs) 


utilized due to the combination of sum factorization and vectorisation (Fig. 19). 
For the Richards equation (as for other PDEs before) the number of millions of 
degrees of freedom per second is independent of the polynomial degree with this 
approach. We measure a total speedup of 3 compared to the naive implementation 
in test simulations on a Intel Haswell Core 17-5600U 2.6 GHz CPU with first order 
DG base functions on a structured 32 x 32 x 32 mesh for subsurface and 32 x 32 
grid for surface flow. Even higher speedups are expected with higher-order base 
functions and matrix-free iterative solvers. The fully-coupled combination of the 
Richards solver obtained with the code generation framework and surface-runoff is 
tested with DG up to fourth order on structured as well as on unstructured grids. 
Parallel simulations are possible as well. 
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Fig. 19 Performance of the Richards solver implemented with the code generation framework for 
EXA-DUNE 


7 Conclusion 


In EXA-DUNE we extended the DUNE software framework in several directions 
to make it ready for the exascale architectures of the future which will exhibit a 
significant increase in node level performance through massive parallelism in form 
of cores and vector instructions. Software abstractions are now available that enable 
asynchronicity as well as parallel exception handling and several use cases for these 
abstractions have been demonstrated in this paper: resilience in multigrid solvers 
and several variants of asynchronous Krylov methods. Significant progress has been 
achieved in hardware-aware iterative linear solvers: we developed preconditioners 
for the GPU based on approximate sparse inverses, developed matrix-free operator 
application and preconditioners for higher-order DG methods and our solvers are 
now able to vectorize over multiple right hand sides. These building blocks have 
then been used to implement adaptive localized reduced basis methods, multilevel 
Monte-Carlo methods and a coupled surface-subsurface flow solver on up to 14k 
cores. The EXA-DUNE project has spawned a multitude of research projects, running 
and planned, as well as further collaborations in each of the participating groups. We 
conclude that the DUNE framework has made a major leap forward due to the EXA- 
DUNE project and work on the methods investigated here will continue in future 
research projects. 


266 


P. Bastian et al. 


References 


i; 


2: 


3: 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Abadi, M.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http:// 
tensorflow.org/ 

Alonso, R., Santillana, M., Dawson, C.: On the diffusive wave approximation of the shallow 
water equations. Eur. J. Appl. Math. 19(5), 575—606 (2008) 

Amestoy, P.R., Duff, I.S., Koster, J., J.-Y., L.: A fully asynchronous multifrontal solver using 
distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23(1), 15—41 (2001) 


. Ascher, D., Dubois, P., Hinsen, K., Hugunin, J., Oliphant, T.: Numerical Python. Lawrence 


Livermore National Laboratory, Livermore (1999) 


. Bastian, P.: A fully-coupled discontinuous Galerkin method for two-phase flow in porous 


media with discontinuous capillary pressure. Computat. Geosci. 18(5), 779-796 (2014) 


. Bastian, P., Blatt, M., Dedner, A., Engwer, C., Klofkorn, R., Kornhuber, R., Ohlberger, M., 


Sander, O.: A generic grid interface for parallel and adaptive scientific computing. Part II: 
Implementation and tests in DUNE. Computing 82(2-3), 121—138 (2008). https://doi.org/10. 
1007/s00607-008-0004-9 


. Bastian, P., Blatt, M., Dedner, A., Engwer, C., Kl6fkorn, R., Ohlberger, M., Sander, O.: A 


generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework. 
Computing 82(2-3), 103—119 (2008). https://doi.org/10.1007/s00607-008-0003-x 


. Bastian, P., Blatt, M., Scheichl, R.: Algebraic multigrid for discontinuous Galerkin discretiza- 


tions of heterogeneous elliptic problems. Numer. Linear Algebra Appl. 19(2), 367-388 (2012). 
https://doi.org/10.1002/nla.1816 


. Bastian, P., Engwer, C., Fahlke, J., Geveler, M., Gdddeke, D., Iliev, O., Ippisch, O., Milk, 


R., Mohring, J., Miithing, S., Ohlberger, M., Ribbrock, D., Turek, S.: Advances concerning 
multiscale methods and uncertainty quantification in EXA-DUNE. In: Software for Exascale 
Computing—SPPEXA 2013-2015. Lecture Notes in Computational Science and Engineering. 
Springer Berlin (2016) 

Bastian, P., Müller, E., Miithing, S., Piatkowski, M.: Matrix-free multigrid block- 
preconditioners for higher order discontinuous Galerkin discretisations. J. Comput. Phys. 394, 
417—439 (2019). https://doi.org/10.1016/j.jcp.2019.06.001 

Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: A proposal for user-level failure 
mitigation in the MPI-3 standard. Department of Electrical Engineering and Computer Science, 
University of Tennessee (2012) 

Buis, P., Dyksen, W.: Efficient vector and parallel manipulation of tensor products. ACM Trans. 
Math. Softw. 22(1), 18-23 (1996). https://doi.org/10.1145/225545.225548 

Cantwell, C., Nielsen, A.: A minimally intrusive low-memory approach to resilience for 
existing transient solvers. J. Sci. Comput. 78(1), 565-581 (2019). https://doi.org/10.1007/ 
s10915-018-0778-7 

Christie, M., Blunt, M.: Tenth SPE comparative solution project: a comparison of upscaling 
techniques. In: SPE Reservoir Simulation Symposium (2001). https://doi.org/10.2118/72469- 
PA 

Chronopoulos, A., Gear, C.W.: s-step iterative methods for symmetric linear systems. J. 
Comput. Appl. Math. 25(2), 153-168 (1989) 

Dongarra, J., Hittinger, J., Bell, J., Chacon, L., Falgout, R., Heroux, M., Hovland, P., Ng, E., 
Webster, C., Wild, S., Pao, K.: Applied mathematics research for exascale computing. Tech. 
rep., U. S. Department of Energy, Office of Science, Advanced Scientific Computing Research 
Program (2014). http://science.energy.gov/~/media/ascr/pdf/research/am/docs/EM W Greport. 
pdf 

Elfverson, D., Ginting, V., Henning, P.: On multiscale methods in Petrov—Galerkin formulation. 
Numer. Math. 131(4), 643-682 (2015). https://doi.org/10.1007/s002 1 1-015-0703-z 

Engwer, C., Dreier, N.A., Altenbernd, M., Göddeke, D.: A high-level C++ approach to 
manage local errors, asynchrony and faults in an MPI application. In: Proceedings of 26th 
Euromicro International Conference on Parallel, Distributed, and Network-Based Processing 
(PDP 2018) (2018) 


Exa-Dune—Flexible PDE Solvers, Numerical Methods and Applications 267 


19. 


20. 


21. 


22. 


23; 
24. 
25. 


26. 


27. 


28. 


29. 


30. 


31. 


32 


33; 


34. 


35. 


36. 


37. 


38. 


39. 


Engwer, C., Miithing, S.: Concepts for flexible parallel multi-domain simulations. In: Dickopf, 
T., Gander, M.J., Halpern, L., Krause, R., Pavarino, L.F. (eds.) Domain Decomposition 
Methods in Science and Engineering XXII, pp. 187-195. Springer, Berlin (2016) 

Ern, A., Stephansen, A., Vohralik, M.: Guaranteed and robust discontinuous Galerkin a 
posteriori error estimates for convection-diffusion-reaction problems. J. Comput. Appl. Math. 
234(1), 114-130 (2010). https://doi.org/10.1016/j.cam.2009.12.009 

Ern, A., Stephansen, A., Zunino, P.: A discontinuous Galerkin method with weighted averages 
for advection-diffusion equations with locally small and anisotropic diffusivity. IMA J. Numer. 
Anal. 29(2), 235-256 (2009). https://doi.org/10.1093/imanum/drm050 

Fehn, N., Wall, W., Kronbichler, M.: A matrix-free high-order discontinuous Galerkin com- 
pressible Navier-Stokes solver: a performance comparison of compressible and incompressible 
formulations for turbulent incompressible flows. Numer. Methods Fluids 89(3), 71-102 (2019) 
Fog, A.: VCL C++ vector class library (2017). http://www.agner.org/optimize/vectorclass.pdf 
Fritze, R., Leibner, T., Schindler, F.: dune-gdt (2016). https://doi.org/10.528 1/zenodo.593009 
Gagliardi, F., Moreto, M., Olivieri, M., Valero, M.: The international race towards exascale 
in Europe. CCF Trans. High Perform. Comput. 1(1), 3-13 (2019). https://doi.org/10.1007/ 
s42514-019-00002-y 

Geveler, M., Ribbrock, D., Géddeke, D., Zajac, P., Turek, S.: Towards a complete FEM-based 
simulation toolkit on GPUs: unstructured grid finite element geometric multigrid solvers with 
strong smoothers based on sparse approximate inverses. Comput. Fluids 80, 327-332 (2013). 
https://doi.org/10.1016/j.compfluid.2012.01.025 

Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned 
conjugate gradient algorithm. Parallel Comput. 40(7), 224-238 (2014) 

Gmeiner, B., Huber, M., John, L., Riide, U., Wohlmuth, B.: A quantitative performance study 
for Stokes solvers at the extreme scale. J. Comput. Sci. 17, 509-521 (2016). https://doi.org/10. 
1016/j.jocs.2016.06.006 

Géddeke, D., Altenbernd, M., Ribbrock, D.: Fault-tolerant finite-element multigrid algorithms 
with hierarchically compressed asynchronous checkpointing. Parallel Comput. 49, 117-135 
(2015). https://doi.org/10.1016/j.parco.2015.07.003 

Gupta, S., Patel, T., Tiwari, D., Engelmann, C.: Failures in large scale systems: long-term 
measurement, analysis, and implications. In: Proceedings of the 30th IEEE/ACM International 
Conference on High Performance Computing, Networking, Storage and Analysis (SC) (2017) 
Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively parallel multigrid 
solvers. SIAM J. Sci. Comput. 38(5), S217-S239 

Iliev, O., Mohring, J., Shegunov, N.: Renormalization based MLMC method for scalar elliptic 
SPDE. In: International Conference on Large-Scale Scientific Computing, pp. 295-303 (2017) 
Kempf, D.: Code generation for high performance PDE solvers on modern architectures. Ph.D. 
thesis, Heidelberg University (2019) 

Kempf, D., Heß, R., Miithing, S., Bastian, P.: Automatic code generation for high-performance 
discontinuous Galerkin methods on modern architectures (2018). arXiv: 1812.08075 

Keyes, D.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2-3), 70-77 
(2011). https://doi.org/10.1016/j.crme.2010.11.002 

Klawonn, A., Köhler, S., Lanser, M., Rheinbach, O.: Computational homogenization with 
million-way parallelism using domain decomposition methods. Comput. Mech. 65(1), 1-22 
(2020). https://doi.org/10.1007/s00466-019-01749-5 

Kohl, N., Thénnes, D., Drzisga, D., Bartuschat, D., Rüde, U.: The HyTeG finite-element 
software framework for scalable multigrid solvers. Int. J. Parallel Emergent Distrib. Syst. 34(5), 
477-496 (2019). https://doi.org/10.1080/17445760.2018.1506453 

Kretz, M., Lindenstruth, V.: Vc: A C++ library for explicit vectorization. Softw. Practice Exp. 
42(11), 1409-1430 (2012). https://doi.org/10.1002/spe.1149 

Kronbichler, M., Kormann, K.: A generic interface for parallel cell-based finite element 
operator application. Comput. Fluids 63, 135-147 (2012). https://doi.org/10.1016/j.compfiuid. 
2012.04.012 


268 P. Bastian et al. 


40. Lengauer, C., Apel, S., Bolten, M., Chiba, S., Rüde, U., Teich, J., GréBlinger, A., Hannig, 
F., Köstler, H., Claus, L., Grebhahn, A., Groth, S., Kronawitter, S., Kuckuk, S., Rittich, H., 
Schmitt, C., Schmitt, J.: ExaStencils: Advanced multigrid solver generation. In: Software for 
Exascale Computing—SPPEXA 2nd phase. Springer, Berlin (2020) 

41. Liang, X., Di, S., Tao, D., Li, S., Li, S., Guo, H., Chen, Z., Cappello, F.: Error-controlled lossy 
compression optimized for high compression ratios of scientific datasets. In: IEEE Big Data, 
pp. 438—447 (2018). https://doi.org/10.1109/BigData.2018.8622520 

42. Lu, Y.: Paving the way for China exascale computing. CCF Trans. High Perform. Comput. 
1(2), 63-72. (2019). https://doi.org/10.1007/s42514-019-00010-y 

43. Luporini, F., Ham, D.A., Kelly, P.H.J.: An algorithm for the optimization of finite element 
integration loops. ACM Trans. Math. Softw. 44(1), 3:1-3:26 (2017). https://doi.org/10.1145/ 
3054944 

44. Message Passing Interface Forum: MPI: a message-passing interface standard version 3.1 
(2015). https://www.mpi-forum.org/docs/ 

45. Milk, R., Rave, S., Schindler, F: pyMOR—generic algorithms and interfaces for model 
order reduction. SIAM J. Sci. Comput. 38(5), $194-S216 (2016). https://doi.org/10.1137/ 
15M1026614 

46. Milk, R., Schindler, F., Leibner, T.: Extending DUNE: the dune-xt modules. Arch. Numer. 
Softw. 5(1), 193-216 (2017). https://doi.org/10.11588/ans.2017.1.27720 

47. Mohring, J., Milk, R., Ngo, A., Klein, O., Iliev, O., Ohlberger, M., Bastian: uncertainty quan- 
tification for porous media flow using multilevel Monte Carlo. In: International Conference on 
Large-Scale Scientific Computing, pp. 145-152 (2015) 

48. Miithing, S., Piatkowski, M., Bastian, P.: High-performance implementation of matrix-free 
high-order discontinuous Galerkin methods (2017). arXiv:1711.10885 

49. Ohlberger, M., Schindler, F.: Error control for the localized reduced basis multiscale method 
with adaptive on-line enrichment. SIAM J. Sci. Comput. 37(6), A2865—A2895 (2015). https:// 
doi.org/10.1137/15 1003660 

50. Orszag, S.: Spectral methods for problems in complex geometries. J. Comput. Phys. 37(1), 
70-92 (1980). https://doi.org/10.1016/0021-999 1(80)90005-4 

51. Piatkowski, M., Miithing, S., Bastian, P.: A stable and high-order accurate discontinuous 
Galerkin based splitting method for the incompressible Navier-Stokes equations. J. Comput. 
Phys. 356, 220-239 (2018). https://doi.org/10.1016/j.jcp.2017.11.035 

52. Richards, L.A.: Capillary conduction of liquids through porous mediums. Physics 1, 318-333 
(1931) 

53. Ruelmann, H., Geveler, M., Turek, S.: On the prospects of using machine learning for the 
numerical simulation of PDEs: training neural networks to assemble approximate inverses. 
Eccomas Newsletter 27-32 (2018) 

54. Teranishi, K., Heroux, M.: Toward local failure local recovery resilience model using MPI- 
ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA 
14, pp. 51:51-51:56 (2014). https://doi.org/10.1145/2642769.2642774 

55. Turek, S., Gdddeke, D., Becker, C., Buijssen, S., Wobker, S.: FEAST—tealisation of hardware- 
oriented numerics for HPC simulations with finite elements. Concurrency Comput. Pract. Exp. 
22(6), 2247-2265 (2010) 

56. Vohralik, M.: A posteriori error estimates for lowest-order mixed finite element discretizations 
of convection-diffusion-reaction equations. SIAM J. Numer. Anal. 45(4), 1570-1599 (2007). 
https://doi.org/10.1137/060653 184 

57. van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient 
numerical computation. Comput. Sci. Eng. 13(2), 22-30 (2011). https://doi.org/10.1109/ 
MCSE.2011.37 

58. Xu, J.: Iterative methods by space decomposition and subspace correction. SIAM Rev. 34(4), 
581-613 (1992). https://doi.org/10.1137/1034116 


Exa-Dune—Flexible PDE Solvers, Numerical Methods and Applications 269 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons licence and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons licence, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons licence and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 
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Abstract In this paper, we present results of the second phase of the project 
ExaFSA within the priority program SPP1648—Software for Exascale Computing. 
Our task was to establish a simulation environment consisting of specialized 
highly efficient and scalable solvers for the involved physical aspects with a 
particular focus on the computationally challenging simulation of turbulent flow and 
propagation of the induced acoustic perturbations. These solvers are then coupled 
in a modular, robust, numerically efficient and fully parallel way, via the open 
source coupling library preCICE. Whereas we made a first proof of concept for 
a three-field simulation (elastic structure, surrounding turbulent acoustic flow in the 
near-field, and pure acoustic wave propagation in the far-field) in the first phase, 
we removed several scalability limits in the second phase. In particular, we present 
new contributions to (a) the initialization of communication between processes of 
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the involved independent solvers, (b) optimization of the parallelization of data 
mapping, (c) solver-specific white-box data mapping providing higher efficiency 
but less flexibility, (d) portability and scalability of the flow and acoustic solvers 
FASTEST and Ateles on vector architectures by means of code transformation, 
(e) physically correct information transfer between near-field acoustic flow and far- 
field acoustic propagation. 


1 Introduction 


The simulation of fluid-structure-acoustic interactions is a typical example for multi- 
physics simulations. Two fundamentally different physical sound sources can be 
distinguished: structural noise and flow-induced noise. As we are interested in 
accurate results for the resulting sound emissions induced from the turbulent flow, it 
is decisive to include not only the turbulent flow, but also the structure deformation 
and the interaction between both. High accuracy requires the use of highly resolved 
grids. As a consequence, the use of massively parallel supercomputers is inevitable. 
When we are interested in the sound effects far away from a flow induced 
fluttering structure, the simulation becomes too expensive, even for supercomputing 
architectures. Hence, we introduce an assumption, we call it the “far-field”. Far from 
the structure and, thus, the noise generation, we assume a homogeneous background 
flow and restrict the simulation in this part of the domain to the propagation 
of acoustic waves. This results in an overall setup with two coupling surfaces— 
between the elastic structure and the surrounding flow, and between the near-field 
and the far-field in the flow domain (see Fig. 1 for an illustrative example). Such 
a complex simulation environment implies several new challenges compared to 


FAR-FIELD Acoustics Q^ 


Coupling Interface I fa 
— 


— 


— 
Inflow FLUID FLOW QF & NEAR-FIELD Accoustics QNA 


Forces 


Fig. 1 Multiphysics fluid-structure-acoustic scenario as used in our simulations in Sect. 6. The 
domain is decomposed into a near-field ‘incompressible flow region? QF = QNA, a far-field 
‘acoustic only region’ QF^, and an ‘elastic structure region’ QS. Note that the geometry is not 
scaled correctly for better illustration 
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“single-physics” simulations: (a) multi-scale properties in space and time (small- 
scale processes around the structure, multi-scale turbulent flow in the near-field, and 
large-scale processes in the acoustic far-field), (b) different optimal discretization 
and solver choices for the three fields, (c) highly ill-conditioned problem, if 
formulated and discretized as a single large system of equations, (d) challenging 
load-balancing due to different computational load per grid unit depending on the 
local physics. 

Application examples for fluid-structure-acoustic simulations can be found in 
several technical systems: wind power plants, fans in air conditioning systems of 
buildings, cars or airplanes, car mirrors and other design details of a car frame, 
turbines, airfoil design, etc. 

Fluid-structure interaction simulations as a sub-problem of our target system 
have been in the focus of research in computational engineering for many years, 
mainly aiming at capturing stresses in the structure more realistically than with 
a pure flow simulation. A main point of discussion in this field is the question 
whether monolithic approaches—treating the coupled problem as a single large 
system of equations—or partitioned methods—glueing together separate simulation 
modules for structures and fluid flow by means of suitable coupling numerics and 
tools—are more appropriate and efficient. Monolithic approaches require a new 
implementation of the simulation code as well as the development of specialized 
iterative solvers for the ill-conditioned overall system of equations, but can achieve 
very high efficiency and accuracy [3, 12, 19, 23, 38]. Partitioned approaches, on the 
other hand, offer large flexibility in choosing optimal solvers for each field, adding 
additional fields, or exchanging solvers. The difficulty here lies in both a stable, 
accurate, and efficient coupling between independent solvers applying different 
numerical methods and in establishing efficient communication and load balancing 
between the used parallel codes. For numerical coupling, numerous efficient data 
mapping methods [5, 26, 27, 32] have been published along with efficient iterative 
solvers [2, 7, 13, 20, 29, 35, 39, 41]. In [6], various monolithic and partitioned 
approaches have been proposed and evaluated in terms of a common benchmark 
problem. Three-field fluid-structure-acoustic interaction in the literature has so far 
been restricted to near-field simulations due to the intense computational load 
(28, 33]. 

To realize a three-field fluid-structure-acoustic interaction including the far-field, 
we use a partitioned approach and couple existing established “single-physics” 
solvers in a black-box fashion. We couple the finite volume solver FASTEST [18], 
the discontinuous Galerkin solver Ateles [42], and the finite element solver CalculiX 
[14] by means of the coupling library preCICE [8]. We compare this approach to 
a less flexible white-box coupling implemented in APESmate [15] as part of the 
APES framework and make use of the common data-structure within APES [31]. 
The assumption which is confirmed in this paper is, that the white-box approach is 
more efficient, but puts some strict requirements on the codes to be coupled, while 
the black-box approach is a bit less efficient, but much more flexible with respect to 
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the codes that can be used. Our contributions to the field of fluid-structure-acoustic 
interaction, which we summarize in this paper, include: 


1. For the near-field flow, we introduce a volume coupling between background 
flow and acoustic perturbations in FASTEST accounting for the multi-scale 
properties in space and time by means of different spatial and time resolution. 

2. For both near-field flow and far-field acoustics, we achieved portability and 
performance optimization of Ateles and FASTEST for vector machines by means 
of code transformation. 

3. In terms of inter-field coupling, we 


(a) increased the efficiency of inter-code communication by means of a new 
hierarchical implementation of communication initialization and a modified 
communicator concept, 

(b) we improved the robustness and efficiency of radial basis function mapping, 

(c) we identified correct interface conditions between near-field and far-field, 
optimized the position of the interface, and ensured correct boundary 
conditions by overlapping near-field and far-field, 

(d) we developed and implemented implicit quasi-Newton coupling numerics 
that allow for a simultaneous execution of all involved solvers. 


4. For a substantially improved inter-code load balancing, we use a regression- 
based performance model for all involved solvers and perform an optimization 
of assigned cores. 

5. We present a comparison of our black-box and to the white-box approach for 
multi-physics coupling. 


These contributions have been achieved as a result of the project ExaFSA— 
a cooperation between the Technische Universität Darmstadt, the University of 
Siegen, the University of Stuttgart, and the Tohoku University (Japan) in the 
Priority Program SPP 1648—Software for Exascale Computing of the German 
Research Foundation (DFG) in close collaboration with the Technical University 
of Munich. In the first funding phase (2013-2016), we showed that efficient 
yet robust coupled simulations are feasible and can be enhanced with an in-situ 
visualization component as an additional software part, but we still reached limits 
in terms of scalability and load balancing [4, 9]. This paper focuses on results of 
the second funding phase (2016-2019) and demonstrates significant improvements 
in scalability and accuracy as well as robustness based on the above-mentioned 
contributions. 

In the following, we introduce the underlying model equations of our target 
scenarios in Sect. 2 and present our solvers and their optimization in Sect. 3 as well 
as the black-box coupling approach and new contributions in terms of coupling in 
Sect. 4. In Sect. 5, we compare black-box coupling to an alternative, efficient, but 
solver-specific and, thus, less flexible white-box coupling for uni-directional flow- 
acoustic coupling. Finally, results for a turbulent flow over a fence scenario are 
presented in Sect. 6. 
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2 Model 


In this section, we shortly introduce the underlying flow, acoustic and structure 
models of our target application. We use the Einstein summation convention 
throughout this section. 


2.1 Governing Equations 


The multi-physics scenario we investigate describes an elastic structure embedded 
in a turbulent flow field. The latter is artificially decomposed into a near-field and a 


far-field. See Fig. 1 for an example. 

Near-Field Flow In the near-field region QF = QNA, the compressible fluid flow is 
modeled by means of the density p, the velocity u; and the pressure p. As we focus 
on a low Mach number regime, we can split these variables into an incompressible 


part p, ui, p, and acoustic perturbations p’, u’,, p’: 
P=pt+o, H=ujtuji, P=ptyp'. (1) 


The incompressible flow is described by the Navier-Stokes equations! 


Ou; 
—! = 0, 2 
Oa (2a) 
a) ð 
ry (pui) + ay (ouju; — tij) = efi. (2b) 


where p is the density of the fluid, and f; summarizes external force density terms. 
The incompressible stress tensor 1; ; for a Newtonian fluid is described by 


Ou; Ou; 
Tij = —poij +H Sac Ga) (3) 


with u representing the dynamic viscosity and 4;; the Kronecker-Delta. 


'To capture the moving structure within the near-field, we actually formulate all near-field 
equations in an arbitrary Lagrian-Eulerian perspective. For the relative mesh velocity, we use 
a block-wise elliptic mesh movement as described in [30]. As we do not show fluid-structure 
interaction in this contribution, however, we formulate all near-field equations in a pure Eulerian 
perspective for the sake of simplicity. 


276 F. Lindner et al. 


Acoustic Wave Propagation The propagation of acoustic perturbations in both the 
near-field and the far-field is modeled by the linearized Euler equations, where in 
the far-field a constant background state is assumed (which implies ap = 0). 


dp’ ðu! dp! 
— — ;— =0 4 
ot to Ox; THI Ox; (4a) 
an a l (4b) 
a u;— — = 
f ot aie Ox j Ox; 
dp’ | ðu; | ap’ ap 
£ i gee 4 
pe a a (4c) 


Here c is the speed of sound. In the near-field, the background flow quantities u; 
and p are calculated from (2), whereas they are assumed to be constant in the 
acoustic far-field. The respective constant value is read from the coupling interface 
with the near-field, which implies that the interface has to be chosen such that the 
background flow values are (almost) constant at the coupling interface. In both 
cases, the coupling between background-flow and acoustic perturbations is uni- 
directional from the background flow to the acoustic equations (4) by means of 
p and ui. 


Elastic Structure The structural subdomain QS is governed by the equations of 
motion, here in Lagrangian description: 


5078; _ OS jk Fik + pfs (5) 
a “axs xs P Ji- 

With X =X > + v; being the position of a particle in the current configuration, X > 

is the position of a particle in the reference configuration, and Ŷ; the displacement. 

F;; is the deformation gradient. S;; is the second Piola-Kirchhoff tensor, and p> 

describes the structural density. The Cauchy stress tensor T relates to S;j via 


1 
s 
tS = ———_ F Sy Fi . 6 
i Get(Fy) ik Ski Fji (6) 


We assume linear elasticity to describe the stress-strain relation. 

The coupling between fluid and structure is bi-directional by means of dynamic 
and kinematic conditions, i.e., equality of interface displacements/velocities and 
stresses, 1.e., 


rs 
rE av; rF _ps 
u = a’ Tij = Tj (7) 


at T! = rS ATF with FF = aQF and TS = an’. 
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3 Solvers and Their Optimization 


Following a partitioned approach, the respective subdomains of the multi-physics 
model as described in Sect. 2 (elastic structure domain, near-field, and far-field) 
are treated by different solvers. We employ the flow solver FASTEST presented 
in Sect. 3.1 to solve for the incompressible flow equations, Eq. (2), and near-field 
acoustics equations, Eq. (4), the Ateles solver described in Sect. 3.2 for the far-field 
acoustics equations, Eq. (4), and finally the structural solver CalculiX introduced in 
Sect. 3.3 for the deformation of the obstacle, Eq. (5). For performance optimization 
of FASTEST and Ateles, we make use of the Xevolver framework, which has 
been developed to separate system-specific performance concerns from application 
codes. We report on the optimization of both solvers further below. 


3.1 FASTEST 


FASTEST is used to solve both the incompressible Navier-Stokes equations (2) and 
the linearized Euler equations (4) in the near-field. 


Capabilities and Numerical Methods The flow solver FASTEST [24] solves 
the three-dimensional incompressible Navier-Stokes equations. The equations are 
discretized utilizing a second-order finite-volume approach with implicit time- 
stepping, which is also second order accurate. Field data are evaluated on a 
non-staggered, body-fitted, and block-structured grid. The equations are solved 
according to the SIMPLE scheme [11], and the resulting linear equation system 
is solved by ILU factorization [36]. Geometrical multi-grid is employed for con- 
vergence acceleration. The code generally follows a hybrid parallelization strategy 
employing MPI and OpenMP. FASTEST can account for different flow phenomena, 
and has the capability to model turbulent flow with different approaches. In our test 
case example, we employ a detached-eddy simulation (DES) based on the ¢ — f 
turbulence model [30]. 

In addition, FASTEST contains a module to solve the linearized Euler equations 
to describe low Mach number aeroacoustic scenarios, which are solved by a second 
order Lax-Wendroff scheme with various limiters. 

Since all equation sets are discretized on the same numerical grid, advantage can 
be taken from the multi-grid capabilities to account for the scale discrepancies of the 
fluid flow and the acoustics. Since the spatial scales of the acoustics are considerably 
larger than those of the flow, a coarser grid level can be used for them. In return, the 
finer temporal scales can be considered by sub-cycling a CFD time step with various 
CAA time steps. This way a very efficient implementation of the viscous/acoustic 
splitting approach can be realized. 
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Performance Optimization Concerning performance optimization, one interest- 
ing point of FASTEST is that some of its kernels were once optimized for old vector 
machines, and thus important kernels have their vector versions in addition to the 
default ones. The main difference between the two versions is that nested loops in 
the default version are collapsed into one loop in the vector version. Since the loops 
skip accessing halo regions, the compiler is not able to automatically collapse the 
loops, resulting in short vector lengths even if the compiler can vectorize them. To 
efficiently run the solver on a vector system, performance engineers usually need to 
manually change the loop structures. In this project, Xevolver is used to express 
the differences between the vector and default versions as code transformation 
rules. In other words, vectorization-aware loop optimizations are expressed as code 
transformations. As a result, the default version can be transformed to its vector 
version, and the vector version does not need to be maintained any longer to 
achieve high performance on vector systems. That is, the FASTEST code can be 
simplified without reducing the vector performance by using the Xevolver approach. 
Ten rules are defined to transform the default kernels in FASTEST to their vector 
kernels. Those code transformations plus some system-independent minor code 
modifications for removing vectorization obstacles can reduce the execution time 
on the NEC SX-ACE vector system by about 85%, when executing a simple test 
case that models a three-dimensional Poiseuille flow through a channel based on the 
Navier-Stokes equations, in which the mesh contains two blocks with 426,000 cells 
each. The code execution on the SX-ACE vector processor works about 2.7 times 
faster than on the Xeon E5-2695v2 processor, since the kernel is memory-intensive 
and the memory bandwidth of SX-ACE is 4x higher than that of Xeon. Therefore, it 
is clearly demonstrated that the Xevolver approach is effective to achieve both high 
performance portability and high code maintainability for FASTEST. 


3.2 Ateles 


In our project, Ateles is used for the simulation of the acoustic far-field. Since 
acoustics scales need to be transported over a large distance, Ateles’ high-order DG 
scheme can show its particular advantages of low dissipation and dispersion error in 
this test case. 


Capabilities and Numerical Methods The solver Ateles is integrated in the 
simulation framework APES [31]. Ateles is based on the Discontinuous Galerkin 
(DG) discretization method, which can be seen as a hybrid method, combining 
the finite-volume and finite-element methods. DG is well suited for parallelization 
and the simulation of aero-acoustic problems, due to its inherent dissipation and 
dispersion properties. This method has several outstanding advantages, that are 
among others the high-order accuracy, the faster convergence of the solution with 
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increasing scheme order and fewer elements compared to a low order scheme 
with a higher number of elements, the local h-p refinement as well as orthogonal 
hierarchical bases. The DG solver Ateles includes different equation systems, 
among others the compressible Navier-Stokes equations, the compressible inviscid 
Euler equations and the linearized Euler equations (used in this work for the 
acoustics far-field). For the time discretization, Ateles makes use of the explicit 
Runge-Kutta time stepping scheme, which can be either second or fourth order. 


Performance Optimization Analyzing the performance of Ateles, originally 
developed assuming x86 systems, we found out that four kinds of code optimization 
techniques are needed for a total of 18 locations of the code in order to migrate 
the code to the SX-ACE system. Those techniques are mostly for collapsing the 
kernel loop and also for directing the NEC compiler to vectorize the loop. In this 
project, all the techniques are expressed as one common code transformation rule. 
The rule can take the option to change its transformation behaviors appropriately 
for each code location. This means that, to achieve performance portability between 
SX-ACE and x86 systems, only one rule needs to be maintained in addition to the 
Ateles application code. We executed a small testcase solving Maxwell equations 
with an 8th order DG scheme on 64 grid cells. The code transformation leads to 
7.5x higher performance. The significant performance improvement is attributed 
to loop collapse and insertion of appropriate compiler directives, which increases 
the vectorization length by a factor of 2 and the vectorization ratio from 71.35% 
to 96.72%. Finally, in terms of the execution time, the SX-ACE performance is 
19% the performance of Xeon E5-2695v2. The code optimizations for SX-ACE 
reduce the performances of Xeon and Power8 by 14% and 6%, respectively. In this 
way, code optimizations for a specific system are often harmful to other systems. 
However, by using Xevolver, such a system-specific code optimization is expressed 
separately from the application code. Therefore, the Xevolver approach is obviously 
useful for achieving high performance portability across various systems without 
complicating the application code. 


3.3 CalculiX 


As structure solver, we use the well-established finite element solver CalculiX[14], 
developed by Guido Dhont und Klaus Wittig. While CalculiX also supports static 
and thermal analysis, we only use it for dynamic non-linear structural mechanics. 
As our main research focus is not the structural computation per se, but the coupling 
within a fluid-structure-acoustic framework, we merely regard CalculiX as a black 
box. The preCICE adapter of CalculiX has been developed in [40]. 


2www.calculix.de. 


280 F. Lindner et al. 


4 A Black-Box Partitioned Coupling Approach Using 
preCICE 


Our first and general coupling approach for the three-field simulation comprising 
(a) the elastic structure, (b) the near-field flow with acoustic equations, and (c) the 
far-field acoustic propagation follows a black-box idea, i.e., we only use input 
and output data of dedicated solvers at the interfaces between the respective 
domains for numerical coupling. Such a black-box coupling requires three main 
functional coupling components: intercode-communication, data-mapping between 
non-matching grids of independent solvers, and iterative coupling in cases with 
strong bi-directional coupling. preCICE is an open source library’ that provides 
software modules for all three components. In the first phase of the ExaFSA project, 
we ported preCICE from a server-based to a fully peer-to-peer communication 
architecture [9, 39], increasing the scalability of the software from moderately to 
massively parallel. To this end, all coupling numerics needed to be parallelized on 
distributed data. During the second phase of the ExaFSA project, we focused on 
several costly initialization steps and further necessary algorithmic optimizations. 
In the following, we shortly sketch all components of preCICE with a particular 
focus on innovations introduced in the second phase of the ExaFSA project and on 
the actual realization of the fluid-acoustic coupling between near-field and far-field 
and the fluid-structure coupling. 


4.1 (Iterative) Coupling 


To simulate fluid-structure-acoustic interactions such as in the scenario shown in 
Fig. 1, two coupling interfaces have to be considered with different numerical and 
physical properties: (a) the coupling between fluid flow and the elastic structure 
requires an implicit bi-directional coupling, i.e., we exchange data in both directions 
and iterate in each time step until convergence; (b) the coupling between fluid flow 
and the acoustic far-field is uni-directional (neglecting reflections back into the near- 
field domain), i.e., results of the near-field fluid flow simulation are propagated to the 
far-field solver as boundary values once per time step. In order to fulfil the coupling 
conditions at the fluid-structure interface as given in Sect. 2, we iteratively solve the 


fixed-point equation 
S(f)\ _ (4 
(rw) = (5): = 


3www.precice.org. 


ExaFSA: Parallel Fluid-Structure-Acoustic Simulation 281 


where f represents the stresses, u the velocities at the interface rs, S the effects 
of the structure solver on the interface (with stresses as an input and velocities as an 
output), F the effects of the fluid solver on the interface (with interface velocities as 
an input and stresses as an output). preCICE provides a choice of iterative methods 
accelerating the plain fixed-point iteration on Eq. (8). The most efficient and robust 
schemes are our quasi-Newton methods that are provided in a linear complexity (in 
terms of interface degrees of freedom) and fully parallel optimized versions [35]. 
As most of our achievements concerning iterative methods fall within the first phase 
of the ExaFSA project, we omit a more detailed description and refer to previous 
reports instead [9]. 

For the uni-directional coupling between the fluid flow in the near-field and the 
acoustic far-field, we transfer perturbation in density, pressure, and velocity from 
the flow domain to the far-field as boundary conditions at the interface. We do this 
once per acoustic time step, which is chosen to be the same for near-field and far- 
field acoustics, but which is much smaller than the fluid time step size (and the 
fluid-structure coupling), as described in Sect. 3.1. 

Both domains are time-dependent and subject to mutual influence. 

In an aeroacoustic setting, the near-field subdomain QNA and far-field subdomain 
QFA, with boundaries PN“ = aQNA and FFA = aQFA are fixed, which means, all 
background information in the far-field are fixed to a certain value. Therefore there 
is only influence of QNA onto QF^, as backward propagation can be neglected. Then 
the continuity of shared state variables on the interface boundary TA = rN4 n PFA 
is 

PFA PNA PFA NA DFA NA 


Pi = Pi Ui; =U; Pi = Pi 3 (9) 


4.2 Data Mapping 


Our three solvers use different meshes adapted to their specific problem domain. 
To map data between the meshes, preCICE offers three different interpolation 
algorithms: (a) Nearest-neighbor interpolation is based on finding the geometrically 
nearest neighbor, i.e. the vertex with the shortest distance from the target or 
source vertex. It excels in its ease of implementation, perfect parallelizability, and 
low memory consumption. (b) Nearest-projection mapping can be regarded as an 
extension to the nearest-neighbor interpolation, working on nearest mesh elements 
(such as edges, triangles or quads) instead of merely vertices and interpolating 
values to the projection points. The method requires a suitable triangulation to 
be provided by the solver. (c) Interpolation by radial-basis functions is provided. 
This method works purely on vertex data and is a flexible choice for arbitrary mesh 
combinations with overlaps and gaps alike. 

In the second phase of the ExaFSA project, we improved the performance of the 
data mapping schemes in various ways. All three interpolation algorithms contain 
a lookup-phase which searches for vertices or mesh elements near a given set of 
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positions. As there is no guarantee regarding ordering of vertices, this resulted in 
O(n-m) lookup operations, n,m € N being the size of the respective meshes. In 
the second phase, we introduced a tree-based data structure to facilitate efficient 
spatial queries. The implementation utilizes the library Boost Geometry* and 
uses an rtree in conjunction with the r- star insertion algorithm. The integration 
of the tree is designed to fit seamlessly into preCICE and avoids expensive copy 
operations for vertices and mesh elements of higher dimensionality. Consequently, 
the complexity of the lookup-phase was reduced to O (log, n) -m with a being a 
parameter of the tree, set to ~5. The tree index is used by nearest-neighbor, nearest- 
projection, and RBF interpolation as well as other parts in preCICE and provides a 
tremendous speedup in the initialization phase of the simulation. 

In the course of integrating the index, the RBF interpolation profited from a 
second performance improvement. In contrast to the nearest-neighbor and nearest- 
projection schemes it creates an explicit interpolation matrix. Setting values one by 
one results in a large number of small memory allocations with a relatively large per- 
call overhead. To remedy this, a preallocation pattern is computed with the help of 
the tree index. This results in a single memory allocation, speeding up the process 
of filling the matrix. A comparison of the accuracy and runtime of the latter two 
interpolation methods is provided in Sect. 5. 


4.3 Communication 


Smart and efficient communication is paramount in a partitioned multi-physics 
scenario. As preCICE is targeted at HPC systems, a central communication instance 
would constitute a bottleneck and has to be avoided. At the end of phase one, 
we implemented a distributed application architecture. The main objective in its 
design is not a classical speed-up (as it is for parallelism) but not to deteriorate 
the scalability of the solvers and rendering a central instance unnecessary. Still, 
a so-called master process exists, which has a special purpose mainly during the 
initialization phase. 

At initialization time, each solver gives its local portion of the interface mesh to 
preCICE. By a process called re-partitioning, the mesh is transferred to the coupling 
partner and partitioned there, i.e., the coupling partner’s processes select interface 
data portions that are relevant for their own calculations. The partitioning pattern is 
determined by the requirements of the selected mapping scheme. The outcome of 
this process is a sparse communication graph, where only links between participants 
exist that share a common portion of the interface. While this process was basically 
in place at the end of phase one, it was refined in several ways. 

MPI connections are managed by means of a communicator which represents an 
n-to-m connection including an arbitrary number of participants. The first imple- 


4www.boost.org. 
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mentation used only one communication partner per communicator, essentially 
creating only 1-to-1 connections. To establish the connections, every connected 
pair of ranks had to exchange a connection token generated by the accepting side. 
This exchange is performed using the network file system, as the only a-priori 
existing communication space common to both participants. However, network file 
systems tend to perform badly with many files written to a single directory. To 
reduce the load on the file system, a hash-based scheme was introduced as part of 
the optimizations in phase two. With that, writing of the files is distributed among 
several directories, as presented in [26]. This scheme features a uniform distribution 
of files over different directories and, thus, minimizes the files per directory. 

However, this obviously resulted in a large number of communicators to be 
created. As a consequence, large runs hit system limits regarding the number of 
communicators. Therefore, a new MPI communication scheme was created as an 
alternative. It uses only one communicator for an all-to-all communication, resulting 
in significant performance improvements for the generation of the connections. 
This approach also solves the problem of the high number of connection tokens 
to be published, though only for MPI. As MPI is not always available or the 
implementation is lacking, the hash-based scheme of publishing connection tokens 
is still required for TCP based connections. 


4.4 Load Balancing 


In a partitioned coupled simulation solvers need to exchange boundary data at the 
beginning of each iteration, which implies a synchronization point. If computational 
cores are not distributed in an optimal way among solvers, one solver will have to 
wait for the other one to finish its time step. Thus, the load imbalance reduces the 
computational performance. In addition, in a one way coupling scenario, if the data 
receiving solver is much slower than the other one, the sending partner has to wait 
until the other one is ready to receive (in synchronized communication) or store the 
data in a buffer (in asynchronous communication). In the first phase, the distribution 
of cores over solvers was adjusted manually and only synchronized communication 
was implemented, resulting in idle times. 


Regression Based Load Balancing We use the load balancing approach proposed 
in [37] to find the optimal core distribution among solvers: we first model the solver 
performance against the number of cores for each domain and then optimize the 
core distribution to minimize the waiting time. Since mathematical modeling of the 
solvers’ performance can be very complicated, we use an empirical approach as 
proposed in [37], first introduced in [10], to find an appropriate model. 

Assuming we have a given set of m data points, consisting of pairs (p, fp) 
mapping the number of ranks p to the run-time fp, we want to find a function f (p) 
which predicts the run-time against p. Therefore, we use the Performance Model 
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Normal Form (PMNF) [10] as a basis for our prediction model: 
n . 
f'(p) = > crp” log} (p), (10) 
k=1 


where the superscript i denotes the respective solver, n is a a-priori chosen number 
of terms, ix, jx € No and cx is the coefficient for the kth regression term. The next 
step is to optimize the core distribution such that we achieve minimal overall run 
time which can be expressed by the following optimization problem: 


a i F(pi,.--, pi) with F(pı,... pi) = max( f! (pi)) 
T i 


l 
subject to ‘> Di <P. 


i=l 


This optimization problem is a nonlinear, possibly non-convex integer program. 
It can be solved by the use of branch and bound techniques. But, if we assume that 
the f! are all monotonically decreasing, i.e., assigning more cores to a solver never 
increases the run-time, we can simplify the constraints to P = sy pi and solve 
the problem by brute-forcing all possible choices for p;. That is, we iterate over all 
possible combinations of core numbers and choose the pair that minimizes the total 
run-time. For more details, please refer to [37]. 


Asynchronous Communication and Buffering For our fluid-structure-acoustic 
scenario shown in Fig. 1, we perform an implicitly coupled simulation of the elastic 
structure interacting with the incompressible flow over a given discrete time step 
(marked simply as ‘Fluid’ in Fig. 2). This is followed by many small time steps for 
the acoustic wave propagation in the near-field, which are coupled in a loose, uni- 
directional way to the far-field acoustic solver (executing the same small time steps). 
To avoid waiting times of the far-field solver while we compute the fluid-structure 
interactions in the near-field, we would like to ‘stretch’ the far-field calculations 
such that they consume the same time as the sum of fluid-structure time steps and 
acoustic steps in the near-field (see Fig. 2). To achieve this, we introduced a fully 
asynchronous buffer layer, by which the sending participant was decoupled from 
the receiving participant, as shown in Fig. 2. Special challenges to tackle were the 
preservation of the correct ordering of messages, especially for TCP communication 
which does not implement such guarantees in the protocol. 


4.5 Isolated Performance of preCICE 


In this section, we show numerical results for preCICE only. This isolated approach 
is used to show the efficiency of the communication initialization. In addition, 


ExaFSA: Parallel Fluid-Structure-Acoustic Simulation 285 


_ tg _ tg 
A Fluid NFA | Fluid NFA 
Buffer 
ETENEL 
B FFA FFA 
lto ' ts 


Runtime t 


> 


Fig. 2 Coupling scenario between participant A (performing a time step for the incompressible 
fluid (or fluid-structure interaction) followed by many time steps of the near-field acoustic 
simulation (NFA)) and participant B (performing the same small acoustic steps for the far-field 
(FFA) after receiving acoustic data from the near-field solver). Without buffering, inevitable idle 
times for participant B are created. NFA is linked to FFA through send operations. Therefore, the 
runtimes of NFA and FFA are matched through careful load-balancing. Shown here: A send buffer 
decouples NFA and FFA solver for send operations, prevents idle times, and allows for a more 
flexible processor assignment 


we show stand-alone upscaling results. Other aspects are considered elsewhere: 
(a) the mapping accuracy is analyzed in Sect. 5, (b) the effectiveness of our load 
balancing approach as well as the buffering for uni-directional coupling are covered 
in Sect. 6. If not denoted otherwise, the following measurements are performed on 
the supercomputing systems SuperMUC? and HazelHen.° 


Mapping Initialization: Preallocation and Matrix Filling As described previ- 
ously, one of the key components of mapping initialization is the spatial tree 
which allows for performance improvements by accelerating the interpolation 
matrix construction. Figure 3 compares different approaches to matrix filling and 
preallocation: (a) no preallocation: using no preallocation at all, i.e., allocating each 
entry separately, (b) explicitly computed: calculate matrix sparsity pattern in a first 
mesh traversal, allocate entries afterwards, and finally fill the matrix in a second 
mesh traversal, (c) computed and saved: additionally cache mesh element/data point 
relations from the first mesh traversal and use them in the second traversal to fill 
the matrix with less computation, (d) spatial tree: use the spatial tree instead of 
brute-force pairwaise comparisons to determine mesh components relevant for the 


528x Intel-Xeon-E5-2697 cores, 64 GB memory per node. 
624x Intel Xeon-E5-2680 cores, 128 GB memory per node. 
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Fig. 3 Mapping initialization. Comparison of different preallocations methods for mesh sizes 
6400 (left sub-figure) and 10,000 (right sub-figure) on two ranks per participant. The plot compares 
times spent in the stages of preallocation and filling of matrices for both the evaluation matrix and 
the interpolation matrix of an RBF mapping with localized Gaussian basis functions including 6 
vertices of the mesh. The total time required is the sum of all bars of one measurement. Note the 
logarithmic scaling of the y-axis. The measurements were performed on one node of the sgscl1 
cluster, using 4x Intel Xeon E3-1585 CPUs 


mapping. Each method can be considered as an enhancement of the previous one. 
As it becomes obvious from Fig. 3, the spatial tree was able to provide us a with an 
acceleration of more than two orders of magnitude. 


Communication For communication and its initialization, we only present results 
for the new single-communicator MPI based solution. For TCP socket communica- 
tion that still requires the exchange of many connection tokens by means of the file 
system, we only give a rough factor of 2.5 that we observed in terms of acceleration 
of communication initialization. Note that this factor can be potentially higher as 
the number of processes and, thus, connections grows larger, and that the hash- 
based approach removed the hard limit of ranks per participant inherent to the old 
approach. 

In Figs.4, 5 and 6, we compare performance results for establishing an MPI 
connection among different ranks using many-communicators for 1-to-1 connec- 
tions with using a single communicator representing an n-to-m connection. In our 
academic setting, both Artificial Solver Testing Environment (ASTE) participants 
run on n cores. On SuperMUC, each rank connects to 0.47 ranks, on HazelHen, with 
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Fig. 4 Communication. Publishing of MPI connection information from participant A for the 
many-communicator approach. The timings of the new single-communicator approach are not 
shown, as they are almost negligible with a maximum of 2 ms 
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Fig. 5 Communication. Runtime for establishing the connection between the participants using 
MPI_Comm_accept andMPI_Comm_connect 


a higher number of ranks per node, each rank connects to 0.3m ranks. The amount 
of data transferred between each connected pair of ranks is held constant with 1000 
rounds of transfer of an array of 500, and 4000 double values from participant B 
to participant A. Each measurement is performed five times of which the fastest 
and the slowest runs are ignored and the remaining three are averaged. We present 
timings from rank zero, which is synchronized with all other ranks by a barrier, 
making the measurements from each rank identical. Note, that the measurements 
are not directly comparable between SuperMUC and HazelHen due to the different 
number of cores per node and that the test case is even more challenging than actual 
coupled simulations. In an actual simulation, the number of partner ranks per rank 
of a participant is constant with increasing number of cores on both sides. 

Figure 4 shows the time to publish the connection token. The old approach 
requires to publish many tokens, which obviously becomes a performance bottle- 
neck as the simulation setup moves to higher number of ranks. The new approach, 
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Fig. 6 Communication. Times for 1000 rounds of data transfer of a vector of 500 or 4000 doubles, 
respectively, from participant B to A. For the transfer, the synchronous MPI routines (MPI_ Send 
and MPI_Recv) have been used 


on the other hand, only publishes one token. It is omitted in the plot, as the times are 
negligible (<2 ms). In Fig. 5, the time for the actual creation of the communicator 
is presented. The total number of communication partners per communicator is 
smaller with the old many-communicator concept (as the communication topology 
is sparse). However, the creation of many |-to-1 communicators is substantially 
slower than the creation of one all-to-all communicator for both HPC systems. 
Finally, in Fig. 6 the performance for an exchange of data sets of two different 
sizes is presented. The results for single- and many-communicator approaches are 
mostly on par with the notable exception of the SuperMUC system. There, the new 
approach suffers a small but systematic slow-down for small message sizes. We 
argue that this is a result of vendor specific settings of the MPI implementation. 


Data Mapping As described above, we have further improved the mapping 
initialization, in particular by applying a tree-based approach to identify data 
dependencies induced by the mapping between grid points of the non-matching 
solver grids and to assemble the interpolation matrix for RBF mapping. Accord- 
ingly, we show both the reduction of the matrix assembly runtime (Fig. 3) and the 
scalability of the mapping, including setting up the interpolation system and the 
communication initialization. 

These performance tests of preCICE are measured using a special testing 
application called ASTE.’ This application behaves like a solver to preCICE but 
provides artificial data. It is used to quickly generate input data and decompose it 
for upscaling tests. ASTE generates uniform, rectangular, two-dimensional meshes 
on [0, 1] x [0, 1] embedded in three-dimensional space with the z-dimension always 


Thttps://github.com/precice/aste. 
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Fig. 7 Data mapping initialization. Strong scaling of the initialization of the interpolation matrix 
C and the evaluation matrix A for RBF interpolation on mesh sizes 12007 and 2000? with Gaussian 
(m = 6) basis functions on up to 4224 processors on the HazelHen HPC system 


set to zero. The mesh is then decomposed using a uniform approach, thus producing 
partitions of same size as far as possible. Since we mainly look at the mapping part 
which is only executed as one of the participants, we limit the upscaling to this 
participant. The other participant always uses one node (28 resp. 24 processors). 
The mesh size is kept constant, i.e., we perform a strong scaling. The upscaling of 
an RBF mapping with Gaussian basis functions is shown in Fig. 7. 


5 Black-Box Coupling Versus White-Box Coupling 
with APESMate 


In the above section, we have evaluated the performance of the black-box coupling 
tool preCICE. In this section, we introduce an alternative approach that allows 
to couple different solvers provided within the framework APES [31]. Black- 
box data mapping in preCICE only requires point values (nearest neighbor and 
RBF mapping), and in some cases (nearest projection) connectivity information on 
the coupling interface. The white-box coupling approach of APESmate [25] has 
knowledge about the numerical schemes within the domain, since it is integrated 
in the APES suite, and has access to the common data-structure TreELM [22]. 
APESmate can directly evaluate the high order polynomials of the underlying 
Discontinuous Galerkin scheme. Thus, the mapping in preCICE is more generally 
applicable, while the approach in APESmate is more efficient in the context of 
high order scheme. Furthermore, APESmate allows the coupling of all solvers of 
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the APES framework, both in terms of surface and in terms of volume coupling. 
The communication between solvers can be done in a straightforward way as all 
coupling participants can be compiled as modules into one single application. Each 
subdomain defines its own MPI sub-communicator, a global communicator is used 
for the communication between the subdomains. During the initialization process, 
coupling requests are locally gathered from all subdomains and exchanged in a 
round-robin fashion. As all solvers in APES are based on an octree data-structure 
and a space-filling curve for partitioning, it is rather easy to get information about 
the location of each coupling point on the involved MPI ranks. In the following, we 
compare both accuracy and runtime of the two coupling approaches for a simple 
academic test case that allows to control the ‘difficulty’ of the mapping by adjusting 
order and resolution of the two participants. 


Test Case Setup We consider the spreading of a Gaussian pressure pulse over a 
cubic domain of size 5 x 5 x 5 unit length, with an ambient pressure of 100,000 Pa 
and a density of 1.0kg/m?. The velocity vector is set to 0.0 for all spatial directions. 
To generate a reference solution, this test case is computed monolithically using the 
inviscid Euler equations. 

For the coupled simulations, we decompose the monolithic test case domain into 
an inner and an outer domain. The resolution and the discretization order of the 
inner domain are kept unchanged. In the outer domain, we choose the resolution 
and the order such that the error is balanced with that of the inner domain. See 
[15] for the respective convergence study. To be able to determine the mapping 
error at the coupling interface between inner and outer domain, we choose the time 
horizon such that the pressure pulse reaches the outer domain, but is still away from 
the outer boundaries to avoid any influences from the outer boundaries. The test 
case is chosen in a way, that the differences between the meshes at the coupling 
interface increase, thus increasing the difficulty to maintain the overall accuracy in 
a black-box coupling approach. Table | provides an overview of all combinations of 
resolution and order in the outer domain used for our numerical experiments, where 
the total number of elements per subdomain is given as nElements, the number 
of coupling points with nCoupling points and the scheme order by nScheme order, 
respectively. For time discretization, we consider the explicit two stage Runge-Kutta 
scheme with a time step size of 1076 for all simulations. 


Table 1 White-box coupling test scenario with Gaussian pressure pulse combinations of orders 
and resolution used for the evaluation of the mapping methods 


Test case a Test case b Test case c Test case d 
Inner Outer Inner Outer | Inner Outer | Inner Outer 
nElements 32,768 | 124,000 | 32,768 | 7936 | 32,768 992 | 32,768 124 


nCoupling points | 55,296 9600 | 55,296 | 3456 | 55,296 | 1536 | 55,296 | 1176 
nScheme order 3 4 3 6 3 8 3 14 
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Fig. 8 Mapping accuracy black-box (APESmate) versus white-box (preCICE) approach. Com- 
parison of the L2 error (with the analytical solution as reference) for the Gaussian pressure 
pulse test case variants and exemplary illustration of the coupling point distribution when using 
DG for test case (right figure). We compare the black-box data mapping methods Radial-Basis 
Functions (RBF) interpolation and Nearest Projection (NP) with the direct white-box evaluation of 
APESmate. The RBF mapping uses local Gaussian basis functions covering three mesh points in 
every direction 


Mapping Accuracy In terms of mapping accuracy, it is expected, that the APES- 
mate coupling is order-preserving, and by that not (much) affected by the increasing 
differences between the non-matching grids at the coupling interface, while pre- 
CICE should show an increasing accuracy drop when the points become less and 
less matching. This is the case for increasing order of the discretization in the 
outer domain. Figure 8 illustrates first results. As can be clearly seen, the white- 
box coupling approach APESmate provides outstanding results by maintaining 
the overall accuracy of the monolithic solution for all different variations of the 
coupled simulations, independent of the degree to which the grids are non-matching 
(increasing with increasing order used in the outer domain). For the interpolation 
methods provided by preCICE, the error increases considerably with increasing 
differences between the grids at the interface. As the error of the interpolation 
methods depends on the distances of the points (see Fig. 8), the error is dominated 
by the large distance of the integration points in the middle of the surface of an 
octree grid cell in the High Order Discontinuous Galerkin discretization. 


Accuracy Improvement by Regular Subsampling We can decrease the L2 error 
of NP and RBF and improve the solution of the coupled simulation by providing 
values at equidistant points on the Ateles side as interpolation support points. The 
number of equidistant points is equal to the number of coupling points, hence 
as high as the scheme order. With this new implementation, the error shown in 
Fig. 9a decreases considerably compared to the results in Fig. 8. We achieve an 
acceptable accuracy for all discretization order combinations. However, the regular 
subsampling of values in the Ateles solver increases the overall computational time 
substantially as can be seen in Fig. 9b. 
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Fig. 9 Mapping accuracy and runtime of black-box with equidistant subsampling versus white- 
box approach. L2 error behavior (a) and computational time (b) for the RBF interpolation, when 
using equidistant (RBFgg) and non-equidistant (RBF) point distributions for data mapping 


To improve the NP interpolation, it turned out that in addition to providing 
equidistant points, oversampling was required to increase the accuracy. Our inves- 
tigation showed, that an oversampling factor of 3 is needed to achieve almost the 
same accuracy as APESmate. In spite of the additional cost of many newly generated 
support points, the runtime does not increase as much as for RBF, since for the RBF 
a linear equation system has to be computed, while for NP a simple projection needs 
to be done. 


Summary and Runtime Comparison Figure 10 shows a summary of all tested 
methods for the interpolation/evaluation before and after improvements. The inte- 
grated coupling approach APESmate provides not just very accurate results, but also 
low runtimes. At this point, we want to recall that this is as expected—the white- 
box approach makes use of all internal knowledge, which gives it advantages in 
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Fig. 10 Mapping accuracy and runtime summary of black-box versus white-box approach. L2 
error and computational time for all methods 
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terms of accuracy and efficiency. On the other hand, this internal knowledge binds 
it to the solvers available in the framework, while preCICE can be applied to almost 
all available solvers. Further details regarding this investigation can be found in 
[15, 16]. 


6 Results 


This section presents a more realistic test case, the turbulent flow over a fence, to 
assess the overall performance of our approach. Analyses for accuracy and specific 
isolated aspects are integrated in the sections above. 


6.1 Flow over a Fence Test Case Setup 


As a test case to assess the overall scalability, we simulate the turbulent flow over 
a (flexible) fence and the induced acoustic far-field as already shown schematically 
in Fig. 1. The FSI functionality of FASTEST has been demonstrated earlier many 
times, e.g. in [34]. Thus we focus on the acoustic coupling. 

As boundary conditions, we use a no-slip wall at the bottom and the fence 
surface, an inflow on the left with wp x, outflow convective boundary conditions 
on the right, periodic boundary conditions in y-directions, and slip conditions at 
the upper boundary for the near-field flow. For the acoustic perturbation, we apply 
reflection conditions at the bottom and the fence surface, zero-gradient condition at 
all other boundaries. The acoustic far-field solver uses Dirichlet boundary conditions 
at its lower boundary (see also Eq. (9)). Therefore, the upper near-field boundary is 
not the coupling interface, but we instead overlap near-field and far-field as shown in 
Fig. 11. Figures 12 and 13 show a snapshot of the near-field flow and the near-field 
and far-field acoustic pressure, respectively. 


6.2 Fluid-Acoustics Coupling with FASTEST and Ateles 


To demonstrate the computational performance of our framework using FASTEST 
for the flow simulation in the near-field, the high-order DG solver Ateles for the 
far-field acoustic wave propagation, and preCICE for coupling, we show weak 
scalability measurements for the interaction between near-field flow simulation and 
far-field acoustics. We keep both the mesh and the number of MPI ranks in the 
near-field flow simulation fixed. In the far-field computed with Ateles, we refine the 
mesh to better capture the acoustic wave propagation. We use a multi-level mesh 
with a fine mesh at the coupling interface to allow a smooth solution at the coupling 
interface between the near-field and the far-field. We refine the far-field mesh in 
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Fig. 11 Flow over a fence test case. Schematic view of the computational domain (left) and 
applied parameters (right). Colors indicate spatial discretization order in the various regions: The 
FV domain is completely second order, while the DG domain has second order at the coupling 
interface for reduced coupling interpolation errors, and subsequently increases the order in various 
layers for the far-field transport 


Fig. 12 Flow over a fence test case with FASTEST. Snapshot of the flow in the recirculation area 
behind the fence. Red/blue indicate acoustic pressure, grey shades show the modelled turbulent 
kinetic energy (for a ¢ — f DES model) 


two main steps: in the first step, we only refine the mesh at the coupling interface. 
In the next step, we first refine the whole mesh, and again the mesh at the coupling 
interface in the third and fourth step. Due to the refinement at the coupling interface, 
the number of Ateles ranks participating in the interface increases such that this 
study also shows that the preCICE communication does not deteriorate scalability. 
Table 2 gives an overview of the configurations used for the weak scaling study. 

To find the optimal core distribution for all setups, the load balancing approach 
proposed in Sect. 4 is used. This analysis shows that for the smallest mesh resolution 
with 24,864 elements in the far-field, the optimal core distribution is 424 cores for 
the near-field domain and 196 cores for the far-field. For all other setups, we assume 
perfect scalability, i.e., we choose the number of cores proportional to the number 
of degrees of freedom in the weak scaling study and increase the number of cores 
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Fig. 13 Flow over a fence test case. Snapshot of the acoustic pressure in a coupled simplified 
setup 


Table 2 Flow over a fence test case 


Cores in FASTEST/Ateles Ateles elements 
424/196 24,864 
424/756 89,376 
424/1428 177,408 
424/3136 607,488 


424/15,372 1,245,054,720 3,704,064 


Scalability study for the interaction between the near-field flow simulation and the far-field 
acoustics: Summary of mesh details and core numbers for weak scaling. In the FASTEST 
simulation of the near-field flow simulation, we use 52,822,016 elements. In the far-field, Ateles 
uses discretization order 9 


simultaneously by a factor of two in both fields for strong scaling. The scalability 
measurements are shown in Fig. 14. The results show that the framework scales 
almost perfectly up to 6528 cores. 


6.3 Fluid-Acoustics Coupling with Only Ateles 


In Sect. 5 we investigated the suitability of different interpolation methods for our 
simulations. In this section, we present a strong scaling study for an Ateles-Ateles 
coupled simulation of the flow over a fence test case. The fence is modelled in 
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Fig. 14 Flow over a fence test case. Weak scalability measurement of the fluid-acoustic interaction 
simulation for the fence test case 


Ateles using the newly implemented immersed boundary method, enabling high- 
order representation of complex geometries in Ateles [1]. We solve the compressible 
Navier-Stokes equations in the flow domain with a scheme order of 4 and a four step 
mixed implicit-explicit Runge-Kutta time stepping scheme, with a time step size of 
10-7. The total number of elements in the flow domain is 192,000. For the far-field, 
we use the same setup as for the FASTEST-Ateles coupling. 

The linearized Euler equations in the far-field can be solved in a DG setting in 
the modal formulation, which makes the solver very cheap even for very high order. 
In the near-field domain, the non-linear Navier-Stokes equations are solved with a 
more expensive hybrid nodal-modal approach. Due to this, and the different spatial 
discretizations and scheme orders, both domains have different computational load, 
which requires load balancing. We use static load balancing, since neither the mesh 
nor the scheme order vary during runtime. As both solvers are instances of Ateles, 
we apply the SpartA algorithm [21], which allows re-partitioning of the workload 
according to weights per elements, which are computed during runtime. Those 
weights are then used to re-distribute the elements according to the workload among 
available processes (see [17] for more details). The total number of processes used 
for this test case are 14,336 processes, which is equal to one island on the system. As 
mentioned previously, the total workload per subdomain does not change, therefore 
we start our measurements by providing the lower subdomain 100 processes and 
the upper subdomain 12 processes, which is equal to 4 nodes on the system. This 
number per subdomain is then doubled for each run, the ratio is kept the same. 

Figure 15 shows the strong scaling measurements for both coupling approaches 
(APESmate and preCICE) executed on the SuperMUC Phase2 system. As can be 
clearly seen, both coupling setups Ateles-APESmate-Ateles and Ateles-preCICE- 
Ateles scale almost ideally, however with a lower absolute runtime for the APES- 
mate coupling as expected. 
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Fig. 15 Flow over a fence test case. Strong scaling measurement for Ateles-Ateles coupling using 
both APESmate and preCICE—and the ideal linear scaling as reference 


7 Summary and Conclusion 


We have presented a partitioned simulation environment for the massively parallel 
simulation of fluid-structure-acoustic interactions. Our setup uses the flow and 
acoustic solvers in the finite volume software FASTEST, the acoustic solvers in 
the discontinuous Galerkin framework Ateles as well as the black-box fully parallel 
coupling library preCICE. In particular, we could show that with a careful design 
of the coupling tool as well as of solver details, we can achieve a bottleneck-free 
numerically and technically highly scalable solution. It turned out that efficient ini- 
tialization of point-to-point communication relations and mapping matrices between 
the involved participants, sophisticated inter-code load balancing and asynchronous 
communication using message buffering are crucial for large-scale scenarios. With 
these improvements, we advanced the limits of scalability of partitioned multi- 
physics simulations from less than a hundred cores to more than 10,000 cores. 
Beyond that, we reach a problem size that is not required by the given problem 
as well as scalability limits of the solvers. The coupling itself is not the limiting 
factor for the given problem size and degree of parallelism. To be able to use also 
vector architectures in an efficient sustainable way, we adapted our solvers with a 
highly effective code transformation approach. 
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Abstract Plasma fusion is one of the promising candidates for an emission-free 
energy source and is heavily investigated with high-resolution numerical simula- 
tions. Unfortunately, these simulations suffer from the curse of dimensionality due 
to the five-plus-one-dimensional nature of the equations. Hence, we propose a sparse 
grid approach based on the sparse grid combination technique which splits the 
simulation grid into multiple smaller grids of varying resolution. This enables us 
to increase the maximum resolution as well as the parallel efficiency of the current 
solvers. At the same time we introduce fault tolerance within the algorithmic design 
and increase the resilience of the application code. We base our implementation 
on a manager-worker approach which computes multiple solver runs in parallel 
by distributing tasks to different process groups. Our results demonstrate good 
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convergence for linear fusion runs and show high parallel efficiency up to 180k 
cores. In addition, our framework achieves accurate results with low overhead in 
faulty environments. Moreover, for nonlinear fusion runs, we show the effectiveness 
of the combination technique and discuss existing shortcomings that are still under 
investigation. 


1 Introduction 


Scientists widely agree that the human-made climate change due to CO2 emission 
will pose severe challenges for the future. Hence, to reduce the overall emission 
of greenhouse gases, carbon free energy sources will be required. Nuclear fusion 
is one candidate which offers abundant fuel with a large energy output. However, 
there are still many difficulties to overcome in the task to build an energy-positive 
fusion reactor. A major challenge are micro-instabilities caused by turbulent plasma 
flow. These can be studied by numerical simulations to further improve the design of 
fusion reactors. One of the codes dedicated to tackle this problem is GENE, which 
solves the gyrokinetic Vlasov-Maxwell equations. 

Unfortunately, the computational costs of these simulations are enormous. Since 
they involve calculations on a five-dimensional spatial grid, they suffer from the 
curse of dimensionality, i.e. from the exponential dependence of the grid size on 
dimension. In our project, we overcome this issue with the help of the sparse grid 
combination technique. This method splits a simulation into several independent 
runs on coarser anisotropic grids and then aggregates the results on a sparse grid. 
As a consequence, the curse of dimensionality is alleviated, which makes larger 
simulations feasible. In addition, the arising subproblems can be computed in 
parallel which allows for scaling to larger processor numbers. This enables the use 
of future exascale computers. Finally, the method opens up an algorithmic approach 
to fault tolerance, which will be a key requirement for exascale computing. 

In this paper, we report on the main contributions of the two funding periods of 
our project EXAHD within the priority program Software for Exascale Computing 
of the DFG, with an emphasis on the second funding period. Partners in this joint 
project were the Institute for Parallel and Distributed Systems at the University 
of Stuttgart (PI Pfliiger), the Institute for Numerical Simulations at the University 
of Bonn (PI Griebel), the Institute for Informatics at the Technical University of 
Munich (PI Bungartz), the Max-Planck Institute for Plasma Physics (PI Jenko), the 
Supercomputing Center of the Max-Planck Society Garching (PI Dannert, second 
period), and the Center for Mathematics and Its Applications of the Australian 
National University (Hegland, external partner). 

Our main contributions include significant progress for the solution of higher- 
dimensional plasma flow problems. In particular, we addressed several exascale 
challenges: we have realized the first-ever massively parallel computations with the 
sparse grid combination technique, enabling scalability beyond the petascale for 
mesh-based discretizations by numerically decoupling the underlying systems and 
by introducing a novel level of parallelism. We tackled load-balancing on massively 
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parallel systems, learning from gathered runtime data. And we have developed 
new and innovative approaches for fault tolerance on multiple levels, in particular 
algorithm-based fault tolerance, which can be a crucial aspect in settings where 
checkpoint-restart is infeasible due to the massive amount of data that would have 
to be handled. 

In the first funding period we have focused on linear simulations of hot fusion 
plasma which govern the exponential growth phase of turbulent modes. We have 
developed core algorithms for fault tolerance and scalability, including algorithm- 
based fault tolerance with the combination technique and optimal communication 
schemes. 

In the second funding period, we have extended the simulation setting to 
fully nonlinear simulations, which pose several hurdles as they do not match the 
classical setting for the sparse grid combination technique anymore. We have 
further developed a flexible framework for massively parallel simulations with the 
combination technique, including software interfaces to GENE and the general PDE 
framework DUNE. We have realized fault tolerance within both GENE and the 
combination framework, which even allows to detect and mitigate soft faults, for 
example due to silent data corruption. Finally, we have extended load balancing to 
a fully data-driven approach based on machine learning. 

In the following, we first give an introduction to the underlying theory and the 
mathematical model. To this end, we describe the sparse grid combination technique 
and the gyrokinetic approach to plasma physics underlying GENE. Then, we outline 
the principles behind our approach to fault tolerance. The third section focuses on 
the implementation while numerical results are presented in section four. 


2 Theory and Mathematical Model 


2.1 The Sparse Grid Combination Technique 


The sparse grid combination technique [10] is a method for approximating high- 
dimensional problems based on sparse grids [2]. It yields an alternative representa- 
tion of a sparse grid solution, not reliant on hierarchical surpluses but using different 
coarse full grid solutions to build a combination solution. The underlying idea 
of the sparse grid approximation was first introduced by Smolyak for the case of 
quadrature [24] and was since applied to a broad field of applications [7, 21]. 

Let’s assume we want to approximate a given function u. Furthermore let Qn be 
the regular Cartesian grid on [0, 1]“ with mesh size hy = 27” := (27”!,..., 274) 
resulting in 2”i + 1 points along the i-th direction, depending on whether there 
are points on the left and/or right boundary. Arbitrary rectangular domains can be 
treated by scaling them onto the unit hypercube. 

Now we can define a piecewise linear approximation of u on Qn, 


Un = 5 Un,j Pn,j (1) 
j 
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where un,j are the function values at the grid points and ¢y j are piecewise d-linear 
hat functions with support of volume Ty (2 hy, ) anchored at the grid points, which 
we call nodal basis. There are various other types of basis functions one could 
choose in order to increase the order of the approximation. For various examples 
we refer to [2]. In this presentation, we stick to the piecewise linear case for reasons 
of simplicity. 

The function Eq. (1) can also be represented in the so called hierarchical basis, 


tn= >) > oj dey = >> te, (2) 


<n j é<n 


where the hierarchical basis functions ej are those hat functions of level £ with 
odd indices only (for £; > 0). Here, the expression £ < n is to be understood 
componentwise. We call the linear transformation between the two representations 
hierarchization and dehierarchization, respectively. 

Now, if u possesses a certain smoothness property, namely (for our case of 
piecewise d-linear hierarchical basis functions) that its second mixed derivative 
is bounded, then the size of the coefficients ag j decays as ~272l4l1, Hence, they 
are called hierarchical surplusses. This leads to the idea of approximating uy by 
truncating the sum in Eq. (2). Compared to the full Cartesian grid, merely a subset of 
points carries a basis function, which is called a sparse grid. This can be formalized 
as follows: For any multi-index set J that is downward-closed (for any £ € J all 
’ < £ are in the set as well) we set 


un = Do ie 3) 


lel 
with n; = max{@; : £ e J} defining the target level. The classical sparse 
grid found in the literature for an isotropic grid with n = n - 1 is given by 


LT={Le Né : |£|; < n}, i.e. only a standard simplex of levels instead of the full 
hyperrectangle is taken into account. In this case, as shown in [2], the number of 
points is drastically reduced, from N 4 to O(N (log N )4-1), where N = 2” +1. 
Furthermore, for functions from Sobolev spaces with dominating mixed derivatives 
HÈ the asymptotic approximation error with regard to the exact function u is only 
slightly worse, O(N~? (log Ny?!) compared to the usual O(N~?). Generalized 
sparse grids [4] are defined by means of general downward-closed index sets I 
and its associated truncation (3). 

Due to the fact that the hierarchical increments ig can be expressed as differences 
of full grid functions ug, Eq. (3) yields a telescopic sum that evaluates to 


u® = 2a ug ; c = Yee xrlt+z), (4) 


lel z<1 
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with xz being the characteristic function of 7. We call this equivalent representa- 
tion the (generalized) combination technique. Note that most of the combination 
coefficients cg vanish, except the ones whose upper neighbors are not all included 
in J (i.e. the ones near the upper boundary of J). The combination technique is 
advantageous in practical applications, since code that produces full grid solutions 
can be readily used as a black box and different component solutions ug can be 
computed independently from each other, thus introducing a coarse grain layer of 
parallelism. The point set produced by the union of all component grids is a sparse 
grid as before. In practice, if one wants to evaluate the combination solution at a 
point that is not shared by all component grids, the respective component functions 
have to be interpolated according to the basis used. 

The specific combination scheme used throughout this work is the truncated 
planar combination technique, introducing a minimal level £min and a maximal level 
Lmax = n. The formula for the index set reads 


a(l a(d 
T tnin, max = fe = Ng : £ € conv (trin ’ Lmin + a ! EERE, Lmin + rh yt , (5) 


where 2” = (0,..., Limax — Limin, . .  , 0), i. it includes alll level indices that lie 
inside the simplex spanned by £min and its adjacent corners of the hyperrectangle 
defined by min and max. For example, the 2D sparse grid constructed with min = 
(1, 1) and max = (3,3) is shown in Fig. 1, where the simplex is just a shifted 
triangle. 

As aremark, the approximation rate shown above does not, in general, hold in the 
case of solving PDEs with the combination technique, which we are interested in. 
However, it can be shown [3] that the combination technique produces the same rate, 
as long as a certain ANOVA-like error expansion exists. In fact, the combination 
technique is applicable to any quantity (not just functions) that is the solution to 
a problem depending on some discretization parameters, provided there is such an 
error expansion. Alternatively, there is a second route to proof that the sparse grid 
Galerkin approximation and the combination method for a PDE possess errors of 
the same order. To this end, in [8] optimal convergence rates of the combination 
technique for elliptic operators acting on arbitrary Gelfant triples were shown. 


h 
3 + 
i 20,3) 
1 
= T ie + (c) 
: Qinga) 263) 
0123 h ia TE 


Fig. 1 Grids resulting from a 2D combination technique with min = (1, 1) and max = (3, 3). 
The result of the combination is the sparse grid o 1,6,3) 
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Nevertheless, since it may be too hard to proof for a given case that the 
necessary error expansion exists, or that the respective Gelfand triple indeed leads 
to equal approximation rates between the sparse grid Galerkin approach and the 
combination method, this motivates the use of numerical experiments to determine 
the applicability of the combination technique. 

The type of PDE we will be concerned with is a time-dependent initial value 
problem. We will, however, not include the time direction as a dimension of the 
sparse grid approximation, but rather treat it as a parameter. The reason is that the 
solver we will mainly deal with employs an explicit time stepping method, which 
means that anisotropic grids with large time steps may be forbidden by a CFL 
condition. The solver also dynamically adapts the time step size, so that it may 
be disadvantageous to interfere with its capabilities. 

Another difficulty is that the global truncation error introduced by a time stepping 
scheme usually grows exponentially in time and, for a nonlinear PDE, even the exact 
solution may be highly sensitive to perturbations in the initial condition (which are 
always present due to different spatial discretizations). In consequence, we cannot 
expect the component solutions to stay similar for long simulation times and so the 
combination technique would break down. 

Hence, we will look for the solution after some interval AT, starting at time t, 
then repeat the process beginning at time t+ AT and so forth. Using the combination 
technique for the remaining (spatial) dimensions, the solution is constructed from 
the time evolution of the component solutions: 


WO (t+ AT) = Yo ce Tiar (uel(t)] (6) 
lel 


with the time evolution operator 7. The initial values are set by interpolating the 
combination solution from the previous time step onto the different component 
grids. Once in the beginning they are set by discretizing a known initial function 
uo: 


welt) = Tofu] O = Fe lwo) , ) 


where J denotes the interpolation operator onto grid Qe. The interval AT has to be 
chosen sufficiently small, so that the error introduced by the time evolution is small 
compared to the spatial discretization error (which we want to optimize by using the 
combination technique). Sensible values for AT depend on the scenario at hand and 
will be investigated in our experiments. 


2.2 Plasma Physics with GENE 


The most common way to study turbulence on a microscopic scale are the 
gyrokinetic equations. They are a system of partial integro-differential equations for 
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the particle distribution function f in a 5D phase space plus time for every particle 
species s in the plasma (typically hydrogen ions and electrons). The model consists 
of a Vlasov equation (when neglecting collisions) with Lorentz force, 


0 Ze 
ofs tv-Vxfs+— 
ot m 


(E+—xB)- Wf =0, (8) 


coupled with Maxwell’s equations for the electromagnetic fields as well as the self- 
consistent computation of charge and current density from the respective moments 
of the distribution function [6]. 

The gyrokinetic approach uses several approximations tailored to the setting of 
a suspended plasma in a strong toroidal magnetic field, which is the concept behind 
tokamaks and stellarators, the most popular types of fusion reactors. Since charged 
particles move on helical trajectories along magnetic field lines, the motion is split 
into the direction parallel to the field line, vj, and the plane perpendicular to it. The 
phase information of the rapid circular motion in the perpendicular direction is then 
averaged out and its magnitude encoded in the magnetic moment u œ |v; |?. Also, 
the coordinate system is commonly aligned to the magnetic field lines and, lastly, a 
ôf -splitting is employed, dividing the distribution function into a static equilibrium 
distribution and an unknown turbulent part g. 

Eventually one arrives at a 5 + 1 dimensional nonlinear partial differential 
equation for g(x, y, Z, Vi, H; t) of the general form 


- = L8 + N8) (9) 
where £ and N are the linear resp. nonlinear parts of the differential operator. 

One of the most widely used codes to solve the gyrokinetic equations is 
GENE [17]. It is an Eulerian (fixed grid) solver that uses mostly finite differences 
to discretize the spatial domain and an explicit Runge-Kutta scheme of fourth order 
for the time evolution. Even though it is highly optimized and employs domain 
decomposition to scale well on large computing systems, it still requires huge 
amounts of computation time and for some scenarios the desired resolution is still 
out of scope. This is why we deem GENE a good candidate to profit from the 
combination technique. Not only are the component solutions cheaper in terms of 
memory, hence always feasible to compute, but additionally can afford a larger time 
step size (due to the CFL condition), reducing the computational cost further. GENE 
operates in different modes described here for later reference: 


Linear/Nonlinear In the linear mode, N(g) in Eq. (9) is neglected. In this case one 
is interested in the growth rate of the turbulence, i.e. the eigenvalue of £ with the 
largest real part and its corresponding eigenvector. To this end, GENE can either be 
used as a direct eigenvalue or as an initial value solver. The nonlinear mode treats the 
full equation and is used to study the quasi-stationary state after the initial growth 
phase. 
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Local/Nonlocal The local case uses the so called flux tube approximation, simu- 
lating only a small cross section of the full torus. Periodic boundary conditions are 
employed in the x- and y-direction, enabling Fourier transformation. In nonlocal 
runs the whole domain along the x-direction is treated, losing periodicity in x. 


Adiabatic Electrons Setting the number of species to one (only ions), the electrons 
are not treated with the full kinetic model, which reduces the problem size. This 
approximation was mostly used for our results. 


In order to use GENE within our combination technique framework, several 
adjustments to GENE had to be implemented during the project. They are minor 
enough to not undermine the black box functionality. First of all, the grid setup 
of GENE had to be adjusted so that grids using 2 grid points in each respective 
dimension become nested. This is important because, on the one hand, the com- 
bination technique requires nested grids and, on the other hand, powers of two are 
necessary for an efficient parallelization in GENE. As a result, only the left boundary 
may carry a grid point (cf. Sect.2.1), which was originally not the case for every 
direction. Additionally, the j-direction uses a Gauss-Laguerre grid by default for 
efficient quadrature. Here we had to switch to an equidistant grid, losing accuracy, 
but a potential solution to this problem using Clenshaw-Curtis points is discussed 
in [18]. Furthermore, GENE requires an unusual quasi-periodic boundary condition 
in the z-direction, which we had to incorporate into the interface of our framework. 

Finally some performance issues had to be overcome. At every restart, GENE 
performs an initialization routine that produces a large overhead when the simula- 
tion time itself is short. In particular for nonlocal runs, a large gyromatrix has to be 
assembled which is costly. To this end we had to integrate the functionality to store 
and reload this matrix from memory. The same had to be done for the checkpoints 
which could previously only be written to and read from disk [13]. 


2.3 Fault Tolerance 
2.3.1 Fault Tolerant Combination Technique 


The Fault Tolerant Combination Technique (FTCT) is an algorithm-based fault 
tolerant version of the Sparse Grid Combination Technique, which was proposed 
in the project proposal of phase one of EXAHD and first published in [12]. The 
method addresses the problem of both hard faults and soft faults.! Since these 
faults cause a corruption or loss of data on certain processors, it is not guaranteed 
that all combination grids can safely contribute to the combination solution. After 
identifying such a fault we therefore need to construct a new combination scheme 


'Hard faults are detected by the operating system and usually cause the affected system part to fail 
while soft faults remain undetected by the system and usually appear in the form of bitflips during 
computation or processing of data. 
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Fig. 2 Recovering the combination scheme after two failed component grids (middle left). A 
new scheme (middle right) is constructed which excludes the failed resources while utilizing a 
component grid from the next lower subdiagonal. The resulting sparse grid (right) should be only 
slightly less accurate than the original sparse grid (left) 


that excludes faulty parts from the original scheme. Ultimately, this modified 
scheme should only consist of existing component grids, thus eliminating the 
necessity to recompute results. In addition, switching to less and possibly coarser 
component grids should only slightly decrease the accuracy compared to the original 
scheme. In case of a time-dependent problem with frequent recombination, the 
FTCT restores the original combination scheme for all future combinations” and 
continues computation. We will briefly summarize the main steps of the algorithm: 
fault detection and reconstruction of a fault-free combination scheme. 

Fault detection heavily depends on the class of faults that we experience. For 
hard faults or process failures, typically the operating system in combination with 
the MPI runtime can be used to identify all processes which are affected by a fault. 
These faulty ranks are then removed and computation can be continued. In case of 
soft or silent faults it is more challenging to detect a faulty component as the failure 
is undetected by the operating system. Therefore, dedicated routines are used to 
detect such faults, e.g. running a program multiple times or with different algorithms 
or by checking application dependent properties. An example of the latter case 
can be found in Sect. 3.3 where we describe our implementation of the silent fault 
detection. 

Once all faulty component grids are detected, the FTCT constructs a new fault- 
free combination scheme (see Fig. 2). The problem of finding such a new scheme 
is known as the General Coefficient Problem (GCP). For further information about 
the problem and on how to solve it most efficiently we refer to [11, 12]. In order to 
increase convergence speed and the probability of finding such a solution we already 
compute the component solutions of two additional lower diagonals? of the level 
set from the start, alongside the required ones, even though they originally have 
a coefficient of zero. Since the degrees of freedom of component grids decrease 


*In case of process failures either failed resources are replaced by spare ranks or the individual 
tasks are distributed to the remaining fault-free processes. 


3For dimensions d > 2 it is in fact a d — 1 dimensional slice and it might be tilted in general. 
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exponentially with the magnitude of the level index, this adds only minor overhead 
to the overall computation time. Section 3.3 outlines our implementation of the 
FTCT in more detail. 


2.3.2 Fault Recovery Algorithms 


In addition to the FTCT, a fault tolerant version of GENE was developed (FT- 
GENE). The purpose was to allow FI-GENE to tackle small faults, whereas larger 
faults would be handled by the FTCT. To that end, an object oriented Fortran 
2008 library called libSpina was written. The purpose of this library is to provide 
functionalities that aid the implementation of a fault tolerant application. In contrast 
to existing tools such as ULFM [1], libSpina does not replace any existing MPI 
implementation and does not attempt to extend the MPI standard. 

The library is responsible for managing and separating “spare nodes” from 
the application, the management of a channel for broadcasting errors, detection 
of faulty resources and sanitization of the MPI environment. Instead of forcing a 
complete rewriting of the code in order to fit a fault tolerant framework, libSpina 
provides several preprocessor macros for encapsulating tasks such as exception 
handling and timeout-based error detection. Some checkpointing and message 
logging capabilities are available, but mostly for the initialization steps of FT- 
GENE. The use of the library causes a negligible overhead (circa 3%). 

Since libSpina is just a tool designed to assist in programming, it does not 
provide any specific data recovery mechanism. Whenever a fault occurs, the lost data 
must be recovered by checkpoint/rollback strategies or an algorithmic/approximate 
recovery, which is not addressed by the library. 

FT-GENE inherits the exception handling mechanism of libSpina and is fully 
written using non-blocking MPI calls thanks to libSpina’s macros. Additionally, 
FTI-GENE is able to withstand faults and provides two recovery methods: check- 
point/rollback and blank recovery (purely for testing purposes) which are discussed 
later. 


3 Implementation 


3.1 Parallel Implementation of the Combination Technique 


The main parallelization strategy of our combination framework is the manager- 
worker approach. In this strategy, a dedicated process—the manager—distributes 
work packages to workers which then process their tasks. Within the combination 
technique these tasks consist of solving the respective PDE on a specific component 
grid. Since these grids will usually be still too large to be solved on one MPI rank, 
we assign the tasks to a group of workers which we will call a process group. To 
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Fig. 3 The manager-worker implementation in our combination framework [13]. The manager 
assigns tasks to process groups which perform a domain decomposition to work on the tasks in 
parallel 


avoid a communication bottleneck at the manager, only one dedicated rank in each 
process group—the master—communicates with the manager. This master process 
then broadcasts all information from the manager to the remaining processes in the 
process group. In Fig. 3 one can see an example of the task distribution to individual 
process groups. 

The main concern of our framework is to achieve high parallel efficiency for 
potential exascale computing. Due to the independence of the component grids, we 
can solve all tasks independently. Additionally, all process groups compute each 
task in parallel by applying domain decomposition. This results in two levels of 
parallelism: parallelism between process groups and within process groups. 

Unfortunately, in a chaotic time-dependent setting, we cannot completely decou- 
ple all tasks, as the individual solutions would eventually drift too far apart from the 
exact solution resp. from each other, which would deteriorate the result of the final 
combination. To avoid this effect, frequent recombination is applied (see Fig. 4). 
Here, we construct the sparse grid solution after a short simulation interval by 
gathering all component grids, and then redistribute the aggregated information to 
proceed with the computation. This process introduces a global synchronization 
point as well as global communication. To ensure high parallel efficiency, we 


ty ty to ts i t 
Fig. 4 Solving time-dependent PDEs with the combination technique [13]. Frequent recombina- 


tion is applied in between computation phases to avoid that the individual solutions drift too far 
apart from each other 
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Fig. 5 The recombination step in detail [13]. After the hierarchization phase (top left), the sparse 
grid is globally reduced between process groups (right) and then dehierarchized within each 
process group (bottom left) 


therefore require efficient load balancing (see Sect.3.2) as well as an efficient 
recombination mechanism. 

The recombination (Fig.5) is split into three phases: hierarchization, global 
reduction of the sparse grid and dehierarchization. In the hierarchization step, 
each process group transforms the local solutions from its assigned tasks into 
the hierarchical basis. These hierarchized solutions are then added locally to the 
respective subspaces of the corresponding sparse grid. To this end, each process 
holds the part of the sparse grid that corresponds to their geometric subdomain in a 
buffer. 

Thereafter, the global communication phase begins in which the partial sparse 
grid solutions from all process groups are added together with an MPI_Allireduce. 
In order to achieve low communication overhead, all process groups are constructed 
from the same number of ranks and apply the same domain decomposition on the 
sparse grid. A rank therefore directly communicates only with the ranks of other 
process groups that are assigned to the same subdomain. This procedure heavily 
reduces the number of communications and ensures a low communication overhead. 
Once the global communication has finished, the individual process groups extract 
the hierarchical information for their component grids from the sparse grid. The 
procedure is concluded by the dehierarchization step which transforms the data 
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back to the nodal basis. It should be noted that only the second phase requires 
global communication while the hierarchization and dehierarchization only require 
communication within each process group. 


3.2 Load Balancing 


Load balancing for HPC simulations is routinely implemented via different standard 
techniques, such as domain decomposition or resource assignment. With the mas- 
sively parallel distributed combination technique framework, there is an algorithmic 
way of performing load balancing. Due to the manager-worker scheme with process 
groups, cf. Fig.3, the manager is free to assign multiple component grids to 
different process groups (provided there is enough free memory). In particular, this 
assignment can be chosen to balance the runtimes such that process groups do not 
have to wait for each other longer than necessary. An illustration is given in Fig. 6. 

Of course, this assignment works best if accurate estimates of the individual 
grid runtimes can be obtained beforehand; then, we can even statically assign 
the grids. Reaching further, this can be improved using the information obtained 
in the first solver run: by collecting the runtimes of the (presumably) longest- 
running grids, the process groups can be “filled” with work. The initial dynamic 
“work-stealing” approach was presented by Heene [14]. The next step, dynamic 
reassignment according to load imbalances has not been implemented so far, since 
the reassignment of a grid to another process group could become very costly. In 
addition to the field grid data, extensive simulation data is required for GENE, such 
as the gyromatrix; it would have to be transferred or explicitly recomputed on the 
new process group. Now, using the estimates obtained by the model, a decreasingly 
ordered list can be created, and the grids can be assigned to the process group. This 
means that the relative ordering between tasks is more important than the absolute 
accuracy of the model. 
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Fig. 6 Possible parallel computation scheme for different 2D grids, cf. [22]. In principle, the 
hierarchization step for a task could happen immediately after its completion, further improving 
load balancing. This is not included in our implementation to date, but this does not proof to be a 
significant bottleneck 
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The simplest approach would be to just take the number of grid points into 
account [9], assuming that all grids of the same level sum |£]; take the same time 
to run. In larger scenarios, this will lead to variable results, as the assignment will 
be ambiguous. A more expert knowledge based model can be designed by adding 
a dependence on grid anisotropy [14]. In this model the runtime contribution of the 
number of grid points N is estimated with a function r (N), 


r(N) := mN* +b, (10) 


where the coefficients m, k and b are fitted to runtime data of isotropic grids. 
Moreover, the influence of anisotropy is added through an additional term h(Sg), 
where sg; = th: The total runtime estimate t (N, sg) then reads 


d-1 d-1 
t(N, se) = r(N) - h(se) h(se) := c + X cise + > X cijseise j Panis 
i=l i=l j<i 


(1) 


Again, the coefficients c in the polynomial ansatz for h (sẹ) are fitted to runtime data, 
this time including anisotropic grids. 

For nonlinear, global simulations, the load modeling was recently extended 
to fully data-driven techniques, such as support vector regression and neural 
networks [22]. 


3.3 Fault Tolerant Combination Technique 


Fault tolerance is becoming more and more important in exascale frameworks as the 
increasing process number will most certainly also increase the probability for some 
components to fail. In our case we assume that such an error will most likely occur 
during the computation phase, which takes the majority of the overall runtime. The 
computation step will therefore not complete for all affected process groups. In the 
next section we briefly summarize the progress made in [15, 16, 19, 20]. 

Such faulty groups are detected by the manager at the beginning of the global 
communication phase with our fault simulation layer. This simulation layer imitates 
the behavior of ULFM and returns an error signal if a process has failed that should 
interact in a certain communication. It is also possible to simulate the process failure 
of specific ranks at specific simulation points. Once the affected groups are known, 
the manager process calculates a fault-free combination scheme by solving the GCP 
(cf. Sect. 2.3.1). We use GLPK [5] to solve this optimization problem which returns 
the new combination scheme. 

Unfortunately we cannot proceed directly with the communication at this point 
since failed ranks are contained in the MPI communicators. The naive implemen- 
tation would just remove all failed process groups and continue computation on 
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Fig. 7 If a process failure occurs in one of the process groups, the remaining ranks are declared 
as spare processes [19 


PGI PG2 


Fig. 8 Incase all failed ranks can be replaced by spare processes, the process group is restored [19] 


the remaining groups. This procedure, however, might waste valuable resources if 
only few of the ranks in a process group have failed. We therefore save these non- 
faulty ranks and declare them as spare ranks (Fig. 7). For future faults we can now 
substitute failed ranks by spare ranks to restore a process group (Fig. 8). This enables 
us to simulate high failure rates without quickly losing all process groups. 

Once the MPI environment is restored, our implementation proceeds with 
the recombination step by applying the global communication to the fault-free 
combination scheme. For future computations we restore the original combination 
scheme by redistributing failed tasks to the remaining process groups. These tasks 
are then initialized by extraction from the sparse grid that results from the temporary 
fault-free combination scheme. 


4 Numerical Results 


4.1 Convergence 


A very important task in order to confirm the applicability of the sparse grid 
combination technique to GENE is to show that we can produce meaningful results 
while reducing the computational cost. This has to be verified in different ways and 
for different quantities, depending on the setting of the simulation. Therefore, this 
section is split into two parts. 

The first one is concerned with convergence results for local linear GENE runs, 
where the accuracy of computations of the growth rate Amax and its corresponding 
eigenvector g™») is studied. The second part presents results for nonlinear GENE 
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simulations, in particular the most recent investigations of nonlocal runs. Here, 
one is interested in time averages of certain quantities of interest during the quasi- 
stationary phase, mainly the mean heat flux Qes. 


4.1.1 Linear Runs 
Linear simulations only use the linear part of the gyrokinetic equations, 


dg 
2 : 12 
ap (12) 
according to Eq. (9) (the bold face g signifies the discretized version). The dynamics 
of this equation can be understood by considering an eigenvalue decomposition 
(assuming every eigenvalue has multiplicity one) 


a= So oM Hg Lg =ag, (13) 
A€o (L) 


Note that g is complex valued because a Fourier basis is always used in the y- 
direction and, hence, eigenvalues will be complex in general. 
Plugging this decomposition into Eq. (12) yields the solution 


B= Dae Oe" sg”: (14) 
hea (L) 


The exponential growth (or decay) will, for large times, be dominated by the 
eigenvector corresponding to the eigenvalue Amax which has the largest real part. 
This growth rate plus eigenmode are the quantities one is interested in. To this end, 
GENE can be run either as a direct eigenvalue solver or as an initial value solver. For 
the latter case, one initializes a random initial state and simulates long enough, until 
the shape of g(t) stays constant and its ratio at subsequent times settles on eòma ^t, 

First results for the linear case were obtained during the first funding period 
of this project. The combination technique was successfully applied to both the 
eigenvalue solver and initial value runs. In each case, combining the eigenvalues as 
well as the eigenvectors produced by the component solutions yielded satisfactory 
results. We refer to [18] and omit the details here. 

Instead we want to focus on results obtained for local linear initial value runs 
with the massively parallel framework described in Sect. 3.1, using the possibility 
of frequent recombination after short time intervals. The first studies with this 
framework were conducted during the first half of the second funding period and 
are recorded in [13]. They examined the convergence of the eigenvector in a GENE 
test case simulating ion temperature gradient (ITG) driven instabilities. 

The following combination schemes were set up for this experiment, always 
using the z-, vj- and j-directions for combination (the ky- and ky-direction were 
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Fig. 9 Error of max) compared to the reference solution with £ = (3, 1, 8, 8, 8). The combined 
solutions were obtained by combining after each time step with a total of 6000 time steps 


fixed at level 3 and 1, respectively, as not much resolution is needed in these 
dimensions in local linear settings): The terms “combi 4 grids” and “combi 10 
grids” describe schemes where min and fmax are (0, 0, 1, 1, 1) resp. (0, 0, 2, 2, 2) 
apart. This results in the mentioned number of component solutions. Finally, “combi 
Imin” means that fmin = (3, 1,4, 4, 4) is kept fix regardless of maximal level. In 
order to calculate the error of the obtained eigenvectors, a reference solution was 
computed on a high resolution full grid with £ = (3, 1, 8, 8, 8). All other solutions 
were interpolated to this reference grid and normalized to unity (because only the 
shape matters), then the L2-norm of the difference in absolute value was taken as 
the error measure. 

Figure 9 shows results for all combination schemes as well as single full grid 
solutions for comparison. Here, a recombination was applied after each GENE time 
step, which turned out to yield the best overall results. One can see that the error 
decreases with higher target levels and that the combined solutions compare well to 
the full grid convergence. The higher the number of component grids, the worse the 
error as interpreted at the target level, but keep in mind that the number of degrees of 
freedom is more and more reduced. All combination schemes also beat the best error 
achieved with one of their component grids, which is a crucial test, since otherwise 
one could be content with that one component solution. 

We also have to show that the computational cost is reduced. To this end, different 
combination intervals (in number of time steps) have been investigated as shown in 
Fig. 10. It turns out that the overhead for recombination after each time step was too 
large. There are two main reasons. Firstly, this test case was comparatively small 
so that the relative overhead is more significant than for much larger problem sizes. 
Secondly, the restart behavior of GENE (as discussed in Sect. 2.2) is still not optimal 
so that the pure GENE runtime is much larger for frequent recombination. However, 
the figure also shows that there is a middle ground. For example, the run time of 
“combi Imin” with n = (3, 1, 8, 8,8) at a combination interval of 10 is already 
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Fig. 10 Left: Computation times for “combi Imin” with n = (3, 1, 8, 8, 8) at different combina- 
tion intervals and for the reference full grid solution. Four process groups with 16 processes each 
were used. Right: Relative increase of the error for g®ma) for different combination intervals 


lower by a factor 5 than that of the reference solution. At the same time, the error is 
only increased by about 25% compared to the optimum. 

In conclusion, the applicability of the combination technique with the massively 
parallel framework has been demonstrated for a small test case. The benefit is 
expected to be even bigger for larger problem sizes. 


4.1.2 Nonlinear Runs 


In nonlinear simulations the initial exponential growth phase is halted after some 
time by the increasingly dominant influence of the nonlinear part of the PDE. 
Henceforth, the distribution function is subjected to chaotic dynamics. This means 
that slight disturbances will completely alter the trajectory in the long run (butterfly 
effect) and, thus, the distribution function itself is not a suitable observable anymore. 
Still, distinctive patterns in a typical trajectory can be observed that result in a 
quasi-stationary state. Common observables (often averages or moments of the 
distribution function) will statistically fluctuate around relatively robust mean values 
that only depend on the problem parameters. Hence, the goal of the combination 
technique should be to accurately reproduce said quantities of interest with a similar 
statistical uncertainty. Throughout this chapter, an ITG test case with adiabatic 
electrons is studied. We switched to nonlocal simulations to increase the problem 
size in line with the conclusion of the previous section. 

In order to familiarize ourselves with the behavior of important quantities of 
interest, we initially tested directly combining them instead of g. That is, we choose 
a combination scheme and, once all component runs are finished, we calculate 
the time averages and linearly combine them with the appropriate combination 
coefficients. The method we used to determine when the quasi-stationary phase 
begins and to estimate the statistical uncertainty follows [25]. 

We focused on the mean heat flux Qes because it is one of the most statistically 
robust quantities. Results for various combination schemes in different dimensions 
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Table 1 Results for direct combination of Qes 


max» Lmin Reference Direct combination | Best component 

(10,5,6,5,4), (7,5,3,5,4) 21.7+0.4 23.1 + 0.9 (6% 0.66) 23.7 + 0.4 (9% 0.13) 
(9,5,5,6,5), (6,5,3,4,3) 20.2 + 1.0 28.2 + 8.1 (40% 0.33) 29.9 + 0.5 (48% 0.04) 

(11,5,5,7,4), (7,5,4,4,3) | 21.8 + 0.5 21.6 + 1.4 (1% 0.37) 22.8 + 0.4 (5% 0.04) 


The two values in parentheses denote the relative error and the fraction of work load (in core-hours) 
in relation to the reference solution 


are summarized in Table 1. The reference values in each case were computed 
with a run at target resolution. Unfortunately, adding and subtracting statistically 
independent quantities causes the variance of the resulting quantity to be the sum 
of variances of the summands scaled by the square of their combination coefficient. 
Since we take the standard deviation as a measure of uncertainty, we get 0 = 
Q; ar) '/2 This is why the uncertainty for the second scheme became very high. 
Furthermore, the combined values are often not significantly better than the best 
component solution. 

The problem is that finding an efficient level set is a balancing act. On the one 
hand, the minimal level is restricted by GENE simulations becoming erroneous for 
too low resolutions when physically relevant scales are no longer resolved. On the 
other hand, the maximal level is bounded by running into the statistical uncertainty 
so that an increase in accuracy is wasted. This could be addressed by including the 
simulation length as another dimension in the combination technique but decreasing 
the uncertainty is always expensive since it scales only with the inverse square root 
of the simulation length. On top of all this, there is a large discrepancy in the 
influence of different dimensions, in particular, the error is mostly dominated by 
the resolution in x-direction. The third scheme was chosen in a fashion to address 
these issues and it shows the best agreement with the reference. We currently also 
work on optimizing index sets with a dimensionally adaptive algorithm [4] to further 
mitigate said issues. 

Recently, we obtained first results with the massively parallel framework applied 
to the nonlinear test case. In this context we turned back to frequently recombining 
g even in the nonlinear regime. Much care has to be taken since the trajectories 
on different component grids have the tendency to drift apart fairly quickly. Thus, 
the combination interval should be chosen sufficiently small. However, in contrast 
to the linear case, we found that recombining after each or just a few time steps 
leads to the simulation becoming unstable and the distribution function growing 
uncontrollably. We had to increase the combination interval until stable trajectories 
were achieved. Also, depending on which and how many dimensions were included 
in the combination scheme, simulations could become unstable. The causes of these 
effects are not yet understood and currently under investigation. We suspect that the 
perturbation introduced by distributing the combined checkpoint to the component 
grids is large enough so that each trajectory needs a certain relaxation time to reach 
the quasi-stationary state again. 
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Table 2 Results for frequent recombination of g 


max, Lmin | Reference | Recombination of 8 | Direct combination 
(10,5,5,4,3), (8,5,3,4,3)  [27.3+0.6 [28.4 £0.5 (4%) E 274 + 1.7 (0.3%) 
(11,5,5,4,3), (9,5,3,4,3) | 28.1+1.1 | 24.9 + 0.5 (11%) | 27.9 + 2.2 (0.7%) 
(10,5,3,6,3), (8,5,3,4,3) [27.3 + 1.6 | 27.9 + 0.9 (2%) | 28.4 + 1.4 (4.0%) 


The values shown are the mean heat flux Qes with relative error to the reference in parentheses. 
The alleged outlier is actually closer to the high resolution reference from Table | 
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Fig. 11 Time traces of the mean heat flux Qes for the five component grids of the first scheme in 
Table 2. One sees the effect of the recombination of g after every two units of time 


Nevertheless, results for three 2D combination schemes with a recombination 
interval of two units of simulation time are presented in Table 2. The trajectories 
of Qes for the different component grids are shown in Fig. 11. They look promising 
since one can observe that after a combination step the values “snap” to a similar 
value. The remaining discrepancy is solely caused by the influence of different 
resolutions during the calculation of Qes. There is no functionality yet to directly 
compute this quantity on the underlying sparse grid because its implementation 
would have been too time-consuming. Instead, the values at the combination times 
were gathered and again linearly combined according to the combination scheme. 
With this procedure we achieved results that are in accordance to their reference 
solution. Still, the same considerations as for the one-time combination apply here 
and we look forward to expanding the tests to more optimized index sets in the 
future. 


4.2 Scaling Analysis 


Scalability is a crucial aspect for an exascale-ready framework. For our framework 
this boils down to two aspects: achieving a good load balance to avoid idling process 
groups and keeping the overhead of recombination low as the main computation 
happens in the black box solver. We expect that the black box solver itself already 
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provides good scalability up to a certain number of cores. By parallelizing over 
many tasks we can then boost this scalability to reach process counts beyond the 
capabilities of the solver. This is based on the fact that the number of cores can be 
scaled in two ways: increasing the number of cores in the process groups—scaling 
the solver—or increasing the number of process groups while keeping their core 
numbers equal. 

We will first look at the scalability of the recombination steps [13] for a linear 
test case with max = (14,6, 6,8, 7) and fmin = (9,4,4, 6,4) on Hazel Hen 
(HLRS). In Fig. 12 we can see that the hierarchization scales almost perfectly with 
the number of cores as it does not involve computation between process groups. 
Moreover, it mainly involves local computation with little communication in the 
group. Dehierarchization performs almost identically as it is the inverse function of 
the hierarchization. The local reduction step only adds the hierarchical coefficients 
of the local component grids to the sparse grid. As we enforce the same domain 
decomposition of the sparse grids and the component grids, this introduces no 
communication. Consequently, this operation scales perfectly. The global reduction 
of the sparse grid, however, involves the communication of the decomposed sparse 
grid parts over the process groups. Therefore, it shows no scaling behavior but also 
a comparably low runtime. 

If we now sum up all components of our recombination step (bottom graph 
in Fig. 12), we can see that we can scale up to the whole size of the Hazel Hen 
supercomputer if the number of process groups is adjusted well. As a consequence, 
the framework introduces only a low overhead compared to the computation time 
of GENE itself. 

This observation is confirmed by the measurements taken for nonlinear runs 
on Hazel Hen, shown in Fig. 13. This scenario consisted of 16 grids at relatively 
high resolutions. The optimal linear scaling is reached for moderate numbers of 
processes, yet the total scaling capabilities are limited by the solver’s scaling 
capabilities. We see again that the sparse grid combination technique can play out 
its advantages only if the scenario is chosen to contain many grids, i.e., to have 
significant spans between fmin and €max. Even then, for our use cases, the overhead 
imposed by the combination routines is negligible compared to the total runtime. 


4.2.1 Load Balancing 


We will see here that, using the algorithmic opportunity for load balancing by grid 
assignment based on a good load model, cf. Sect. 3.2, we can achieve good parallel 
efficiency for large simulations—applied to both linear and nonlinear GENE runs. 
The linear and anisotropy-based models were tested on 32 cores [14]. For this 
linear experiment, the scenario was constructed from €min = (3, 1,3,3,3) to 
max = (11,1, 11, 11, 11), containing 425 grids. Figure 14 shows that the initial 
dynamic approach (“work stealing”) is preferable to the static assignment, because 
high parallel efficiencies can be sustained as the number of process groups grows. 
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Fig. 12 Timings for the individual steps of the distributed recombination step for different sizes 
of the process groups. The bottom plot shows the total time of hierarchization, local reduction and 
global reduction. We also included a rough estimate of the computation time for one time step of 
GENE with process groups of size 4096. Graphs from [13] 


We can also conclude that taking the anisotropy into account leads to, on average, 
substantial improvements with respect to the parallel efficiencies. 

In order to cope with the larger-scale nonlinear GENE simulations, a similar 
methodology was used to evaluate data-driven techniques. The methods compared 
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Fig. 13 Timings for a combination scheme computed with the nonlinear GENE functionality: ITG 
with adiabatic electrons, for min = (7, 1, 5, 6, 4) to max = (10, 1,7,8,6), averaged over ten 
combination time steps on Hazel Hen 
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Fig. 14 Parallel efficiency for the linear ITG scenario: linear vs. anisotropy-based, static vs. linear 


were, again, the anisotropy-based model, nearest neighbor estimates, support vector 
regression (SVR) and neural networks [22]. These models were trained on 2048 
randomly sampled tasks. As a reference, the best attainable balance that could be 
produced by the dynamic filling heuristic (cf. Sect. 3.2)—through an estimate that is 
the same as the true runtime—is included in the comparison. This time, the scenario 
contained grids from min = (7, 4,3,4,3) to max = (12, 8, 6, 8, 6) to account 
for the anisotropy in the requirements for different solutions, comprising 237 grids. 
Now, the process group size was not fixed to 32 any more, but could reach up to 215 
or 32,768 processes. 

It is to be noted that Fig. 15 shows regions of high parallel efficiency compared 
to Fig. 14. Still, the differences between the different modeling methods are 
significant: while the nearest neighbor and SVR models lead to a relatively quick 
degradation of parallel efficiency, the expert knowledge-based anisotropy model 
performs a lot better, and the neural network can even achieve near-optimal scaling 
(the improvements over the optimal estimate are of course lucky guesses). This 
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Fig. 15 Parallel efficiencies for the nonlinear, larger ITG scenario: by method and process group 
size (upper), and averaged over the different process group sizes for clarity (lower). The legend of 
the lower plot applies to the upper one in the same way 


advantage is enabled by having enough data samples available. Since GENE is 
routinely run on various HPC machines, strong load models could be generated 
even across machines from the runtime data through neural networks. 


4.3 Fault Tolerance 
4.3.1 Fault Tolerant Combination Technique 


This chapter briefly summarizes the most recent results with hard faults within the 
FTCT from [19]. For further results on hard and soft faults see [15, 16, 20]. 

To simulate statistical effects on the accuracy with faults in the FTCT, we conduct 
a random sampling for generating process failures during the simulation. For this 
sampling we choose the commonly used Weibull distribution [23]: 


k-1 
f(a,’ = (5) e/a (15) 


where k and à are the shape and scale parameter. à is used to directly control the 
failure rate where smaller à values cause larger error rates. This distribution is used 
to draw a failure time after which an MPI rank will fail. E.g., a failure time of 10s 
means that after a wall clock time of 10s a process will fail. 
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Fig. 16 Left: Statistical results for the L2 error of the FTCT with different numbers of faults [19]. 
The errors stay relatively small, even in the presence of around 100 faults the error increases by 
less than a factor of two. Right: Average runtimes for our scaling run with the FTCT [19]. One can 
see that the overhead of the fault recovery is negligible compared to the runtime of the solver 


We show results for one linear test case with a 3D combination in z,y and vy 
with max = (8, 8,8) and min = (5,5,5). The remaining dimensions x an y 
are fixed at 9 and | grid points, respectively. Statistical results for 100 individual 
runs with 512 MPI ranks are shown in Fig. 16 (left). It can be seen that for lower 
failure rates the error is almost identical to the base line error without any process 
failure (dotted line). In case of larger failure rates the accuracy decreases but the 
overall error increase is still tolerable compared to the number of failing ranks. On 
average, the error increases by 0.09%, 4% and 20%, respectively, for lambda values 
of 107, 10° and 10°. 

Apart from error analysis, we are interested in the scaling behavior of the FTCT 
in the presence of faults. For this analysis we again choose a 3D combination in 
zu and vy with €max = (14, 14, 14) and €min = (4, 4, 4). In this case we simulate 
only 300 time steps and apply 3 recombinations, i.e. every 100 steps, and inject a 
single fault in one of the process groups. The process group size stays fixed at 1024 
cores and only the number of process groups is varied in the scaling experiment 
(Fig. 16, right). 

In general we observed a very good scaling behavior even in the presence of 
faults. Of course, the runtimes are slightly increased due to the fault of one process 
group, which reduces the overall process count after the fault occurs. However, the 
scaling properties seem to be unaffected by the fault recovery. This can also be seen 
in the low overhead of the recovery step which is about two orders of magnitude 
lower than the time for solving the PDE. Also the time for the recombination scales 
well even though it involves global communication. 

All in all, our results with linear GENE simulations have shown that we can 
achieve fault tolerance with small computational overhead while keeping a similar 
accuracy. For further details about the experiments and results we refer to [19]. 
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4.3.2 libSpina 


We answer here two questions concerning FT-GENE: how its performance com- 
pares with the performance of standard GENE and what kind of faults the 
framework is able to recognize and tackle. The clusters Hazel Hen, Cobra and Draco 
(MPCDF) were used in the tests which follow. 


Robustness These tests used realistic parameters (based on the benchmark 
ITGTEM) where a random subset of the MPI ranks is selected to fail. Several 
combinations of faults were tested (full node failure, partial node failure, multiple 
failures in different time steps, etc.) as well as different causes for failure (e.g. 
“stop” intrinsic, or externally calling the “pkill” command). 

The library libSpina was able to detect the faults, notify FT-GENE, restore the 
MPI environment and allow FT-GENE to read the last checkpoint and rollback 
the computation. Not all fault types could be handled by libSpina, most notably 
faults caused by interrupting the interconnect hardware. Otherwise, the results were 
consistent and reproducible. 


Performance and Overhead The aim of these tests was to determine the perfor- 
mance impact of libSpina on an application, by running both FT-GENE and GENE 
“back-to-back” on the same hardware with the same parameter files. 

In the following, the grid dimensions are displayed as (x, y, z, vij, u). The same 
notation is used for the MPI partitioning of the domain (that is, n_proc_x, n_proc_y, 
and so on). The runtime is shown in seconds and disregards the initialization of 
GENE and libSpina. The reason is that the initialization of GENE involves several 
performance optimization tests (often lasting several minutes) which would mask 
the true difference in runtime. 


The first part of the results in Table 3 has been repeated many times for many 
different configurations: nonlocal and local, linear and nonlinear, between | and 64 


Table 3 Cobra performance Nodes | u |n procu |GENE | FT-GENE | Loss 
comparison 
2 2 1 379.2s | 386.1s 1.83% 
4 4 2 398.9s | 400.7s 0.46% 
8 |8 4 419.6s |422.4s 0.67% 
16 |16 8 414.5s |416.2s | 0.40% 
32 32 | 16 424.45 | 4324s 1.89% 
64 64 | 32 438.0s | 4514s 3.06% 
150 5 [5 568.4s | 565.3s —0.55% 
300 10 | 10 590.4s | 585.0s —0.91% 
600 20 | 20 714.6s | 700.4s —1.99% 


Topmost: (512,64,40,*,48) points, (8,1,5,*,1) paralleliza- 
tion, nonlocal, kinetic electrons (2 species). Lowermost: 
(512,512,20,*,60) grid size, (1,2,20,*,15) parallelization, 
local, kinetic electrons. Both are nonlinear runs with 100 
time steps and 40 ranks per node 
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Table 4 Hazel Hen . Nodes | x n_proc_x |GENE | FT-GENE | Loss 
performance comparison 
1024 576 | 24 517s | 446s —13.7% 
1024 576 | 24 469s |477s 1.7% 
2048 1152 | 48 1156s | 805s —30.4% 
2048 1152 | 48 836s |914s 9.3% 


Grid size: (*,128,32,64,16) points, parallelization: 
(*,1,8,8,16), nonlocal, adiabatic electrons (l1 species), 
nonlinear run, 1000 time steps, 24 ranks per node 


nodes, in Cobra and Draco with several different types of parallelization. On all 
occasions the result was consistent and the overhead of libSpina lay between —3% 
and +3% which is within statistical fluctuations in this case. For the sake of brevity, 
only the most recent of these results are shown in Table 3. 

There was no deterioration of the performance of FT-GENE for large numbers 
of nodes, as it can be seen in the lowermost part of Table 3. 

Similar experiments were performed on Hazel Hen but provide largely incon- 
clusive results. Small tests (with four nodes) were analogous to the results shown 
in Table 3, but tests involving a large number of resources were inconclusive 
(see Table 4). The overhead fluctuated between —30% and +15%. One of the 
possible explanations is that the job was distributed amongst several islands and 
MPI communications had to be passed through switches. Since switches are shared 
amongst all running jobs, the communication patterns of GENE and FT-GENE 
were disturbed by external factors, causing statistically invalid measurements of 
performance. 


5 Conclusion 


During the EXAHD project a massively parallel software framework for the 
sparse grid combination technique was developed and successfully applied to the 
gyrokinetic solver GENE, to linear as well as nonlinear, nonlocal settings. While its 
functionality has been demonstrated for medium sized test cases we expect an even 
greater benefit when applying it to larger simulations in future experiments. 

Our implementation shows excellent scaling behavior up to 180k cores, and 
allows to scale the application code beyond its specific capabilities, thus making 
it ready for exascale computing. At the same time the overhead incurred to the total 
runtime is negligible. 

The load balancing that can be realized with our approach allows to maintain 
parallel efficiencies close to the optimum. This is achieved by incorporating effects 
of anisotropy in our cost model and applying dynamic task scheduling. By modeling 
the solver loads with neural networks, this scheduling can be further enhanced. 
As an outlook, the additional level of parallelism offered by the combination 
technique may allow for scaling beyond a single system: the computation may be 
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decoupled across compute centers. This would require more advanced distributed 
communication, as the combined solutions would have to be updated on both 
systems. First tests have been performed for this setup, but the overall performance 
and gains of such an approach still need to be evaluated. 

We have also shown that the fault tolerant combination technique can be used 
to construct a resilient framework, which can tolerate hard and even soft faults. 
This algorithm-based fault tolerance introduces only minor effects on the overall 
accuracy and runtime while preserving the scaling properties of the framework. 
Furthermore, FT-GENE demonstrated a reasonable ability to tolerate faults, even 
under the restrictions of relying purely on standard MPI, causing no performance 
overhead. 

Finally, by implementing specialized interfaces, the software framework can 
readily be extended to other codes dealing with high-dimensional functions and we 
look forward to seeing it used in many more applications. 
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Abstract Simulations of cosmic structure formation address multi-scale, multi- 
physics problems of vast proportions. These calculations are presently at the 
forefront of today’s use of supercomputers, and are important scientific drivers 
for the future use of exaflop computing platforms. However, continued success 
in this field requires the development of new numerical methods that excel in 
accuracy, robustness, parallel scalability, and physical fidelity to the processes 
relevant in galaxy and star formation. In the EXAMAG project, we have worked 
on improving and applying the astrophysical moving-mesh code AREPO with 
the goal to extend its range of applicability. We have also worked on developing 
new, powerful high-order discontinuous Galerkin schemes for astrophysics, on 
more efficient solvers for gravity, and on improvements of the accuracy of the 
treatment of ideal magnetohydrodynamics. In this context, we have also studied 
the applied mathematics required for higher-order discretization on dynamically 
moving meshes, thereby providing the foundations for much more efficient and 
accurate methods than are presently in use. Finally, we have worked towards 
publicly releasing two major community codes, AREPO and GADGET-4, which 
represent the state-of-the-art in the field. 
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1 Introduction 


Hydrodynamical simulations of galaxy formation have significantly matured over 
recent years and now enable successful predictions of the build-up of the galaxy 
population starting from cosmological initial conditions left behind by the Big Bang. 
These simulations can track the non-linearly coupled evolution of both baryons 
and dark matter, in principle fully accounting for their mutual influence on each 
other and yielding rich predictions for galaxy properties, the diffuse gas in the 
circumgalactic and intergalactic media, and cosmic dark matter clustering. Provided 
that such hydrodynamic simulations can be pushed to sufficiently large volumes, 
they provide the most powerful approach for forecasting non-linear cosmological 
observables related to clustering in different regimes and at different epochs. 
However, it is extremely challenging to include all the relevant physics, and to 
make the simulations accurate, fast, and scalable enough to be able to exploit the 
full capacity of today’s supercomputers. The EXAMAG project has been aiming to 
advance novel numerical methodologies for astrophysical simulations, and to apply 
them right away to timely research questions in astrophysics. A particular focus has 
been on magnetic field predictions, and the development of higher-order methods. 

In this project review, we report a subset of the results obtained within the 
EXAMAG project. In Sect. 2 we describe the IlustrisTNG simulations, the currently 
most advanced set of magnetohydrodynamical simulations of galaxy formation, 
as well as the Auriga simulations, which focus on predictions for our own Milky 
Way galaxy, in particular on the structure and origin of its magnetic field. In 
Sect. 3, we turn to our recent developments of discontinuous Galerkin hydro- and 
magnetohydrodynamics codes. We also give a short description of some of our 
methodological advances in combining the idea of fully dynamic, unstructured 
meshes with high-order discontinuous Galerkin approaches to hydrodynamics. We 
also report on the application of these higher order methodologies to the problem 
of driven isothermal turbulence, where these methods prove to be particularly 
powerful. We then recount in Sect.4 some of our work on the performance and 
accuracy of the two cosmological hydrodynamical simulation codes GADGET-4 
and AREPO, both of which we have prepared for public release to the community 
as part of this project. Finally, we summarize and give an outlook in Sect. 5. 


2 The IllustrisTNG and Auriga Simulations 


The AREPO code [31] introduced a different approach from the ones so far 
commonly adopted in astrophysics to evolve gas on a computer (smoothed particle 
hydrodynamics, SPH, and Eulerian mesh-based methods, typically utilizing adap- 
tive mesh refinement, AMR). It employs a moving, unstructured mesh where, like 
AMR, the volume of space is discretized into many individual cells, but similar to 
SPH, these cells move with time, adapting to the flow of gas in their vicinity. As a 
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result, the mesh itself, constructed through a Voronoi tessellation of space, has no 
preferred directions or regular grid-like structure, and is highly spatially adaptive, 
making it ideal, in particular, for studying galaxy formation. 

Our ground-breaking “Illustris” hydrodynamical calculation of galaxy formation 
[35] demonstrated the utility of the approach for simulations of structure formation. 
For the first time ever, it reproduced the observed morphological mix of galaxies 
and its dependence on stellar mass. Over the past years we have undertaken 
significant efforts to improve the underlying physics models (especially with respect 
to magnetic fields, and the processes regulating star formation through energetic 
feedback from supernovae and black holes), and the accuracy and scalability of 
the numerical algorithms (for example by developing a hierarchical local time- 
stepping algorithm for gravity). These efforts culminated in the simulation projects 
IllustrisTNG and Auriga. 

The Next Generation (TNG) Illustris project! built on the technical and 
scientific achievements of its predecessor and pushed this line of research further. In 
particular, it has improved upon Illustris by including our newly developed accurate 
solver for ideal magnetohydrodynamics [22, 24], and by extending the dynamic 
range and resolution of the simulated galaxies and haloes significantly through an 
ambitious suite of simulations carried out on the Hazel-Hen supercomputer at the 
High-Performance Computing Center Stuttgart (HLRS) with the help of two large 
compute-time grants by the Gauss Centre for Supercomputing (GCS). We have 
considered three different box sizes, roughly 300, 100, and 50 Mpc on a side, and 
computed extensive resolution studies for each of them, yielding three series or runs, 
entitled TNG300, TNG100, and TNG50. The calculations used up to 24,000 cores, 
required 100 TB RAM, and produced 660 TB of scientific data. A visual impression 
of one of these simulation is given in Fig. 1. As an illustrative result, we show the 
clustering statistics of different matter components at the present epoch in TNG300 
in Fig. 2 [33]. The calculation is able to probe deeply into the non-linear regime, 
over a very large dynamic range, thereby allowing predictions both for the internal 
structure of galaxies as well as their large-scale clustering patterns. 

The scientific analysis of the IlustrisTNG simulations has started in 2018 and has 
already led to many important results for a vast range of scientific questions [e.g. 19, 
20, 27, 33, 34], demonstrating the great utility of these methods. At the time of this 
writing, the TNG simulations have already produced 67 journal publications, and 
about 80 additional papers are currently in preparation by a network of international 
collaborators. In December 2018, we publicly released the data of TNG100 and 
TNG300 [21], with TNG50 to follow in a year’s time, which will further amplify 
the scientific use of the simulations. 

The Auriga simulations [11] use a very similar physics and numerical model as 
IllustrisTNG, but “zoom-in” on individual Milky Way-sized galaxies that are studied 
with much higher resolution. A particular focus here has been on understanding the 
origin of the magnetic fields in galaxies, and on predicting their present structure as 
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os 


stellar density 


Fig. 1 Thin projections through the TNG100 simulation, showing (from top to bottom) the gas 
density field, the metallicity, the magnetic field strength, the dark matter density, and the stellar 
density 
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Fig. 2 The matter autocorrelation function for different mass components in our high-resolution 
TNG300 simulation at redshift z = 0, see [33]. We show results for stellar matter, gas, dark matter, 
black holes, and all the matter, as labelled. The linear theory correlation function is shown in grey 
for comparison 


well as their build-up over cosmic time. In [25] we have shown that the magnetic 
fields grow exponentially at early times owing to a small-scale dynamo with an 
e-folding time of roughly 100 Myr in the centre of haloes until saturation occurs 
around redshift z = 2-3, when the magnetic energy density reaches about 10% 
of the turbulent energy density with a typical strength of 10-50 uG. Outside the 
galactic centres, differential rotation in the discs leads to linear amplification of 
the magnetic fields that typically saturates around z = 0.5-0. The final radial and 
vertical variations of the magnetic field strength can be well described by two joint 
exponential profiles, and are in good agreement with observational constraints. 

We have extended the observational comparisons by computing synthetic Fara- 
day rotation maps due to the magnetic fields [26], for different observer positions 
within and outside the simulated galaxies. We find that the strength of the Faraday 
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Fig. 3 Faraday rotation maps as seen by an external observer for three of the Auriga simulations. 
The top and bottom panels show face-on and edge-on Faraday rotation maps due to the magnetic 
fields in three different simulated disk galaxies Au-23, Au-24, and Au-27 [26]. These predictions 
compare well to observations of real galaxies such as M51 


rotation of our simulated galaxies for a hypothetic observer at the solar circle is 
broadly consistent with the Faraday rotation seen for the Milky Way. The same holds 
for an observer outside the galaxy and the observed signal of the nearby spiral galaxy 
M51, see Fig. 3. However, we also find that the structure of the synthetic all-sky 
Faraday rotation maps vary strongly with azimuthal position along the solar circle. 
This represents a severe obstacle for attempts to reconstruct the global magnetic field 
of the Milky Way from Faraday rotation maps alone without including additional 
observables. 


3 Discontinuous Galerkin Hydrodynamics for Astrophysical 
Applications 


The AREPO code is based on a second order finite volume method on a moving 
mesh. An important part of the EXAMAG project was to develop a higher order 
method for AREPO, for which we have chosen to use a discontinuous Galerkin 
(DG) method due to the significant promise this class of methods holds for high 
performance computing. Here the hydrodynamic or magnetohydrodynamic partial 
differential equations are written in a weak form to construct a finite element 
method. On each cell, the solution and the test functions are approximated by 
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appropriate polynomials. Note that the approximate solution is not required to be 
continuous across cell boundaries. The jumps between neighboring cells are taken 
into account by a numerical flux function based on an approximate Riemann solver 
which provides stability to the method. The time discretisation is performed using a 
Runge-Kutta method. For more details, see [29]. 

Using the discontinuous Galerkin (DG) method brings with it the following 
advantages: 


e DG works naturally on unstructured meshes which makes it suitable for adaptive 
meshes with hanging nodes and for moving mesh methods. 

e DG can be made to work for any order of accuracy with a compact stencil and 
provides spectral-type accuracy. 

e DG is extremely local in data communication, thus ideally suited for efficient 
parallelization on current computer hardware. 


Due to these advantages and also the ability of DG to compute convection 
dominated problems in a stable and accurate manner, the EXAMAG project 
worked towards adopting DG methods for astrophysical applications. Our efforts 
in this direction followed two main paths that will be presented in the following 
subsections. First, in Sect.3.1, we describe the discontinuous Galerkin method 
on a Cartesian mesh with automatic mesh refinement for both the Euler and 
magnetohydrodynamics equations. Then, in Sect. 3.2, the discontinuous Galerkin 
method on a moving mesh is discussed. Finally, in Sect. 3.3 we present some results 
on turbulence simulations with higher order numerics. 


3.1 The Discontinuous Galerkin Method on a Cartesian Mesh 
with Automatic Mesh Refinement 


In a collaboration with the EXA-DUNE project (Peter Bastian, Heidelberg) we 
implemented a two-dimensional hydrodynamics code in the DUNE framework 
with total variation bounded and positivity preserving limiters. A mesh refinement 
triggered by the limiting criteria was also built in, see [9]. 

In [10], we developed a code based on the DG method for compressible flows to 
incorporate and test shock indicators that can determine which cells need limiting. 
We showed that the choice of variables that are limited can have a major influence 
on accuracy; limiting the characteristic variables was compared to limiting the 
conserved variables, with the former being better able to control oscillations. These 
limiters were then combined with a shock indicator, yielding the ability to solve 
complex flow problems in a more efficient and accurate manner since the costly 
limiters need not be applied everywhere. 

These investigations provided the foundations for our implementation of the 
discontinuous Galerkin approach in a new branch of the AREPO code called 
TENET, as presented in [28]. This version of the code supports adaptive mesh 
refinement (AMR) and is able to maintain high-order accuracy at AMR refinement 
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Fig. 4 A Kelvin-Helmholtz simulation with fourth order DG and adaptive mesh refinement (see 
[28]). The simulation starts with 64? cells, and refines in selected regions to an effective resolution 
of 40967. As can be seen in the bottom panel (white markers indicate regions that are enlarged in 
subsequent panels), the solution within every cell contains rich information, consisting of a third 
order polynomial 


boundaries, unlike finite-volume approaches that typically fall-back to first order 
there. As an illustrative example, Fig. 4 depicts a simulation of a Kelvin-Helmholtz 
instability using TENET. In [28] we have also shown that DG has significant 
advantages over a finite volume method. Given the same accuracy target, a higher 
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order DG method requires fewer grid points than a finite volume method, allowing 
for much faster run times, especially for smooth solutions. The higher order DG 
method of TENET is also better in computing discontinuous solutions, although 
the numerical techniques for identifying regions where a limiter has to be applied 
are intricate and still not fully mature. We also showed that the DG approach 
automatically conserves angular momentum in smooth regions which is beneficial 
for many astrophysical problems involving rotating objects. 

We next worked towards adding magnetic fields, leading to the model of ideal 
magnetohydrodynamics in fully compressible flows. The system of ideal magneto- 
hydrodynamical equations poses additional challenges for numerical simulations, 
mainly due to the need to preserve the divergence-free condition on the magnetic 
field, which requires specialized techniques. In [5] we developed an entropy stable 
finite volume scheme based on a symmetrized version of the MHD equations. A 
numerical flux is given which allows for the construction of an entropy conservative 
and entropy stable scheme. It is demonstrated how this new scheme is robust for 
MHD simulations due to its entropy framework, in spite of the divergence condition 
not being explicitly satisfied. In [6] this approach was extended to an explicit 
high order Runge-Kutta discontinuous Galerkin method. This methodology is then 
combined with techniques used to control oscillations near discontinuities, similar 
to [10], where these techniques were introduced for the hydrodynamical case. We 
assessed a different approach for the divergence constraint in [15], where post- 
and pre-processing methods are suggested in order to numerically maintain the 
divergence free constraint. 

Finally, in [12] we implemented high-order MHD in the AREPO code using two 
different approaches for maintaining the divergence constraint, a locally divergence- 
free basis combined with Powell terms for stability, and a hyperbolic divergence 
cleaning method. Two new numerical ingredients were introduced in the DG 
scheme: a non-linear limiting procedure for the magnetic field, and a different 
discretization of the Powell terms, which was found to be a key aspect for stability 
and accuracy of the method. The beneficial properties of the DG method found 
for hydrodynamical simulations were also confirmed by Guillet et al. [12] for 
the MHD case. The resulting scheme shows lower advection errors and better 
Galilean invariance than a finite volume scheme, and hence constitutes a very 
promising approach for more realistic applications in an astrophysical context. In 
Fig.5, we show as an example the DG simulation of a two-dimensional MHD 
blast wave, which is a particularly challenging test case in terms of maintaining 
positive solutions. In Fig. 6, we demonstrate the convergence of our DG algorithms 
when applied to a smooth isodensity MHD vortex problem that is advected at 
different angles through a periodic domain. Our numerical solutions reproduce 
to good accuracy the expected convergence orders for orders 2—6, both for a 
locally divergence-free basis and a Legendre basis. Note that for increasing order 
progressively more degrees of freedom per cell are needed, and thus more storage 
and computational work per element are required. This enlargement of the cost 
per element is a constant factor, however, leaving the slopes of the convergence 
order unaffected. The high-order methods hence always win in overall efficiency 
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Density Magnetic pressure Mach number 


Fig. 5 A two-dimensional MHD blast wave test problem. The density, magnetic pressure and 
Mach number contours are shown on a 256? grid using a third-order discontinuous Galerkin 
scheme where the MHD equations are written in symmetrized form using so-called Powell terms, 
see [12] 


and accuracy once the mesh resolution lies above a finite cross-over point (see 
also [12, 28] for an analysis of the accuracy of different DG-order as a function 
of invested CPU-time). 


3.2 The Discontinuous Galerkin Method on a Moving Mesh 


In a parallel endeavour, we have investigated in EXAMAG the general problem of 
extending higher-order methods to unstructured meshes that move along with the 
flow. The original AREPO approach employed a second order finite volume scheme 
for this purpose. As this works quite well and is able to improve the resolution 
of flow features by minimizing numerical dissipation from advection, we pursued 
extensions of this approach in three different directions. 

The DG method can be proven to be entropy stable and convergent on a 
fixed grid for a scalar conservation law. In a first line of investigation we have 
shown that the entropy stability is maintained on a moving mesh. This was 
demonstrated in [16] and [17] for a semidiscrete arbitrary Lagrangian-Eulerian 
discontinuous Galerkin method. In [18] these ideas were extended to a fully discrete 
method. Numerical experiments have confirmed these properties also for a multi- 
dimensional implementation of the hydrodynamical equations. 

The above studies were restricted to arbitrary Lagrangian-Eulerian discontinuous 
Galerkin methods, where the grid needs no remeshing. In practice, one needs to 
remesh the grid once in a while since otherwise the mesh quality degrades to 
such an extent that the computations can break down. In [1], a DG method for 
2-D Euler equations was developed on triangular grids and the mesh vertices are 
moved in an almost Lagrangian manner. To maintain good mesh quality, only a 
local remeshing was performed in regions where the mesh quality has become poor. 
As an example, Fig. 7 shows an isentropic vortex that is also advecting. Due to the 


EXAMAG: Simulations of the Magnetic Universe 341 


LDF basis with Powell terms Legendre basis with hyperbolic cleaning 

10-2 10-2 

10-4 10-4 
a s 
a a 

8 10-6 8 10-6 

10-8 10-8 

107 10 107 10 

10 21 = , l r we T T T T 
8? 16? 32? 64? 128? 8? 16? 32? 64? 128? 
Resolution Resolution 

10 2 10~? 

10-4 1074 
a a 

S 106 S 106 
B B 
3 3 
À À 

10-8 10-8 

107 10 107 10 

10-22 4G = r t 1 ioe T T T T 
82 16? 32? 64? 128? 8? 16? 32? 64? 128? 
Resolution Resolution 


Fig. 6 Convergence of the MHD isodensity vortex problem computed for different DG orders. 
Solution L? errors are plotted for By (top row) and pressure (bottom row), for both a locally 
divergence-free (LDF) basis with Powell terms (left column), and a Legendre basis with hyperbolic 
cleaning (right column). Errors are measured at time t = 20 after the vortex has crossed the whole 
computational domain. Dotted lines show theoretical slopes for convergence orders 2 to 6. Solid 
and dashed lines correspond to errors for an advection angle œ = 45° and a = 30°, respectively. 
The shaded area corresponds to a range of resolutions for which the vortex is resolved but not 
over-resolved (see [12]) 


high shearing inside the vortex, the mesh would become heavily skewed with time, 
but our remeshing scheme is able to maintain good mesh quality over long time 
intervals. 

Finally, we pursued a third direction. Would it be possible to use a moving mesh 
composed of Voronoi cells, as employed by AREPO for a second order finite volume 
method, also for a high order DG method? The difficulty here is that this type of 
mesh is effectively remeshed everywhere at every time step, implying that the cell 
connectivity and their topology can change during a timestep. This represents a 
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Fig.7 A rotating isentropic vortex in two space dimensions: we show the mesh and the pressure 
of the solution to the Euler equations at various times: (a) t = 0, (b) t = 6, (c) t 12, and (d) 
t = 20. Notice how the mesh quality does not deteriorate thanks to the local remeshing technique 


described in [1] 


significant challenge for DG, although some guidance may also be obtained from 
high-order curvilinear finite element methods (e.g. [7]). Note that a higher order 
method is in particular high order in time, meaning it needs predictor steps between 
two consecutive time steps. Thus the connectivity of a cell at one time step with 
the cell at the next time step is needed, but a moving Voronoi cell may change the 
number of its sides during the step. This leads to significant geometric complications 
in formulating a consistent time evolution of the basis function expansion. The 
associated challenges were successfully overcome in [8], for the moment in two- 
dimensions only. But it appears conceptually straightforward to generalize this 
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solution to three dimensions, therefore the path to a high order DG method in 3D 
using the moving Voronoi mesh of AREPO is now open. 


3.3 Turbulence Simulations with Higher Order Numerics 


We have applied our hydrodynamic DG code developed in [28] to the simulation 
of three-dimensional hydrodynamic subsonic turbulence [4]. This allowed us to 
demonstrate that this DG implementation gives accurate results at noticeably less 
computational cost than a finite volume method. 

Driven magnetohydrodynamical turbulence is an even richer physics problem, 
which is of particular importance in a variety of astrophysical contexts, including 
star formation in the interstellar medium, stellar atmospheres, and the X-ray 
emitting gas in clusters of galaxies. In Pakmor et al. (2019, in prep), we have applied 
our DG-MHD code developed in [12] to the problem of isothermal turbulence in a 
uniform box, with the goal to test different numerical schemes for preserving the 
MHD constraint, and for validating the effectiveness of the high-order DG approach. 
Of particular interest is whether there are any systematic differences between a 
constrained transport (CT) finite-volume MHD approach, which is able to guarantee 
the divergence free constraint to machine precision at all times, and the Powell and 
Dedner approaches for divergence control, for which we also have high order DG 
formulations. It is sometimes suspected that CT may be required to obtain truly 
accurate solutions for this problem. Reassuringly, our results, summarized in Fig. 8, 
do not support this view. The statistical properties of the quasi-stationary turbulent 
flow are very consistent between the different schemes. As expected, the growth rate 
of the turbulent dynamo increases with resolution and order of the scheme. While 
the slope of this relation is similar for all schemes, there are interesting differences 
in the absolute growth rate at a given resolution between the different schemes. 
If anything, here CT appears slightly more dissipative than the Powell approach. In 
contrast, the saturated magnetic energy does not seem to depend on the resolution or 
scheme, provided a minimum resolution is used that again depends on the scheme. 
Interestingly, here the Powell approach is able to numerically represent a working 
dynamo already at lower resolution than the CT approach. 


4 Performance and Public Release of the Two Cosmological 
Hdyrodynamical Simulation Codes GADGET-4 and 
AREPO 


As part of EXAMAG, we have also developed a memory efficient and fast N- 
body/hydrodynamical code, GADGET-4, which is primarily intended for extremely 
large simulations of cosmic structure formation (Gpc* volumes), targeting cos- 


344 V. Springel et al. 


MM — DednerStatic eo DG — Powell —18—1 
MM — Powell g DG — Dedner — 2 


ie) 
8 MM — Dedner 
8 


DG — Powell — 18-2 
MM-CT DG — Dedner —3 
DG — Dedner — 1 DG — Powell — 18-3 


107! 


(Eg / Ekin ) saturated 


MM — DednerStatic DG — Powell — 18-1 

MM — Powell DG — Dedner — 2 

MM — Dedner DG — Powell — 18 —2 
DG — Dedner — 3 


1 0) -2 3 DG — Dedner —1 DG — Powell — 18-3 
101 102 
N 


Fig. 8 Growth rate of the magnetic field strength in units of the eddy-turnover time feq = 
L/(2 Mcs) (top panel) and the ratio of the magnetic to the kinetic energy in the saturated state 
(bottom panel) for isothermal turbulence in a uniform box for different numerical schemes and 
spatial resolution. N is the number of mesh cells per spatial dimension. The turbulence is driven 
by purely solenoidal driving and saturates at a Mach number of M ~ 0.3. We compare different 
numerical schemes to control the divergence of the magnetic field, including constrained transport 
(CT), Dedner cleaning, and Powell terms. Also, we compare finite volume on a moving mesh 
(MM) with high-order DG on a Cartesian mesh 
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mological applications. GADGET-4 (Springel et al., 2019, in prep) represents a 
complete rewrite of the successful and widely used GADGET code [30], using C++ 
and numerous refined algorithms. For example, it supports a variety of additional 
gravity solvers, among them a high-order fast multipole method (FMM), as well as 
hierarchical local time-integration techniques. 

The code is highly scalable, and can be run with two different approaches for 
hybrid parallelization, either a mix of MPI and OpenMP parallelization, or a novel 
shared memory parallelization model based on MPI-3 where one MPI rank is set 
aside on each shared memory node to respond to communication requests from MPI 
processes on remote nodes with minimum latency, thereby realizing truly one-sided 
communication independent of MPI progress engines. Within each node, the MPI 
layer can be bypassed entirely through shared-memory accesses in this method. 

Figure 9 shows the effectiveness of this approach on the SuperMUC-NG machine 
at the Leibniz Supercomputing Centre (LRZ) in a weak scaling test, where the 
TreePM force computation algorithm [2, 30] is used. This approach assembles 
the total force as the sum of a short-range gravitational force computed with a 
hierarchical multipole expansion (here done with a one-sided tree algorithm, [3]), 
and a long-range gravity computed in Fourier space. This approach is particularly 
efficient for periodic cosmological simulations at high redshift where the density 
fluctuations are small and the large-scale residual forces nearly vanish. In this 
regime, pure Tree- or FMM-algorithms need to open many more nodes to accurately 
recover the near cancellation of most of the large-scale forces. 
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Fig.9 Scaling of the GADGET-4 code for cosmological simulations of structure formation on 
the SuperMUC-NG supercomputer. The scaling of the dominant tree calculation is excellent. The 
long-range gravity is done through very large FFT grids, which are communication bound. Our 
column-based parallel FFT (solid green) scales beyond the point where the number of MPI ranks 
exceeds the number of mesh planes (vertical dotted line), but requires more transpose operations 
than a slab-based algorithm (dashed green). For the largest number of nodes, two or more islands 
of the machine are needed, reducing the available cross-sectional bandwidth, which impacts the 
scaling behavior of the communication heavy long-range gravity and domain decomposition 


346 V. Springel et al. 


The most expensive part of the calculation, the Tree-based computation of the 
short-range gravity scales perfectly to 49,152 cores (1024 nodes on SuperMUC- 
NG), thanks to the MPI-3 parallelization scheme, which is able to eliminate a 
mid-step synchronization point that was still necessary in our older conventional 
communication scheme for this algorithm. Because the amount of work per MPI 
rank stays constant in the weak scaling regime for the short-range part of the TreePM 
algorithm, and the amount of data that needs to be imported from neighboring 
ranks stays approximately constant, too, a near perfect scaling should in principle 
be reachable. However, because the communication pattern is complex and highly 
irregular, this has normally become a bottleneck for the scalability to very large 
numbers of MPI ranks. This is here successfully eliminated thanks to our one-sided 
(and fully portable) communication approach. 

In contrast, the FFT-based calculation of the long-range gravity is communication- 
bandwidth bound and shows poorer scalability, as expected, but still stays 
subdominant overall. Here for the largest problem sizes an additional scaling 
bottleneck is resolved by GADGET-4. For a standard slab-based decomposition 
of the FFT (green dashed lines), there comes a point when there are more MPI 
ranks than mesh planes (marked by the vertical dotted line), at which point not 
only scalability ends, but also memory imbalance will quickly grow. This impasse 
is overcome in GADGET-4 with a column-based parallel FFT algorithm (green 
solid lines), which maintains scalability and memory balance up to the largest 
foreseeable problem sizes in cosmology. However, this algorithm requires twice as 
many transpose operations, making it more costly for small problem sizes where the 
slab-based approach is still viable. Like the parallel FFT, the domain decomposition 
algorithm is also communication bound and thus deviates from ideal scalability, but 
this part of GADGET-4 is fast enough to always stay subdominant. We note that 
the alternative hybrid parallelization through a combination of MPI and OpenMP 
yields very good thread scalability, see Fig. 10 (right panel), but shows slightly 
poorer overall scalability when the number of shared-memory nodes becomes large 
due to losses in its MPI communication algorithm (which still contains a midstep 
synchronization point). 

We have also developed new on-the-fly group finding and merger-tree building 
techniques for GADGET-4 (which also scale well, see Fig. 10, left panel), as well as 
sophisticated outputting strategies for light-cones and high angular resolution maps 
of line-of-sight projections of various quantities, such as the total mass (for weak 
lensing). These features are designed to support collisionless N-body simulations 
with extreme particle numbers in the regime of 10!? particles and beyond. The 
huge data volume necessitates that post-processing calculations are done during the 
simulation as much as possible to avoid the need for enormous disk storage capacity. 
In fact, GADGET-4 in principle removes the need to produce any time-slices of 
particle data, thereby eliminating a substantial obstacle to carry out semi-analytic 
galaxy formation on merger trees based on simulations in the trillion particle regime. 
Attempting this with the same approach as in the Millennium simulation [32], 
where of order 100 time slices were produced and merger trees were made in 
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Fig. 10 The left panel shows a strong scaling test of the built-in FOF group finder in GADGET-4, 
while the right panels shows the speed-up of the gravity calculation in GADGET-4 as a function 
of the number of OpenMP threads employed when run in MPI-OpenMP hybrid mode. In the latter 
example, a single MPI rank on a single node of Intel Xeon 6138 cluster was used. Even though the 
node has two processors with 20 physical cores each, the OpenMP scaling extends well into the 
second processor, despite the reduced memory bandwidth this entails. In practice, GADGET-4 is 
best run with at least 2 MPI ranks per processor (corresponding to up to 10 OpenMP threads on 
this cluster). In this regime, the OpenMP scaling is excellent 


post-processing, would require 6 PB of particle storage, something that we can 
completely avoid with the new code. 

As a legacy of EXAMAG, both the GADGET-4 and AREPO codes are scheduled 
for public release in 2019. Weinberger et al. [36] introduces the community version 
of the AREPO code, and provides an overview of the available functionality as 
well an introduction of AREPO to new users. This is augmented with a suite of 
examples of different complexity, a test suite, and a support forum that is hosted 
on the code’s website as a platform to ask questions about the code. A similar 
release is planned for the GADGET-4 code (Springel et al., 2019, in prep). We hope 
that this will foster an active and supportive user community that will contribute 
to the further development of these highly parallel simulation codes, preparing for 
their eventual use on exascale class computers, as envisioned by SPPEXA and its 
EXAMAG subproject. 


5 Summary and Discussion 


The primary research goals of the EXAMAG project have been to develop new 
mathematical methods and physics implementations in state-of-the-art hydrody- 
namical codes, allowing them to be used for groundbreaking astrophysical research 
that can make full use of the capabilities of current and emerging HPC platforms. 
Our approach consisted of leveraging a tight collaboration between applied math- 
ematicians and numerical astrophysicists, thereby allowing a quick transfer of new 
mathematical ideas into applications at the forefront of today’s supercomputer 
applications in astrophysics and cosmology. 
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In hindsight, we feel that our strategy to immediately apply new numerical pro- 
cedures in large application projects has been successful. This provided immediate 
feedback on the most promising development directions, and thus allowed us to 
iteratively improve codes used for production science on powerful supercomputers. 
Our special focus on treatments of ideal magnetohydrodynamics has allowed us 
to make significant progress on the physical fidelity of simulations of galaxy 
formation, thereby making the IIlustrisTNG and Auriga projects possible in the 
first place. The MHD capability also provided the foundations for new solvers 
we developed for anisotropic diffusive transport processes, relevant especially for 
cosmic rays [23], thermal conduction [13] and radiative transport [14]. 

There is no shortage of ideas for developing the performance and capabilities 
of our AREPO, GADGET-4 and DG codes further in the future. Extending our 
DG-MHD techniques to high-order methods for self-gravity and source terms such 
as radiative cooling are an obvious direction. Other challenges lie in the techni- 
cal aspects of parallelization, where especially the MPI-3 based shared-memory 
approach that we have introduced in GADGET-4 looks particularly promising for 
adoption in AREPO as well. It is clear that dedicated and sustained research efforts 
in numerical method development remain the basis for future scientific progress in 
computational astrophysics. 
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elements (SSRVEs) is presented. Furthermore, the highly scalable nonlinear domain 
decomposition methods NL-FETI-DP and nonlinear BDDC are introduced and 
weak scaling results are shown. These methods can be used, e.g., for the solution of 
the microscopic problems. Additionally, some remarks on sparse direct solvers are 
given, especially to PARDISO. Finally, we come up with a computationally derived 
Forming Limit Curve (FLC). 


1 Introduction 


In the EXASTEEL project, we are solving challenging nonlinear multiscale prob- 
lems from computational material science showing parallel scalability beyond a 
million parallel processes. Our software package FE2TI solves large scale contact 
problems in sheet metal forming of microheterogeneous materials and scales 
to some of the largest supercomputers available today. Although an exascale 
computer is not yet available, FE2TI is exascale ready: For our current production 
simulations, we have not pushed the combined parallelism of the FE* multiscale 
computational homogenization method and of our nonlinear solvers to the limit. 
Both, i.e., the FE? method by itself, as well as our nonlinear solvers are scalable 
to the largest supercomputers currently in production in the leading international 
computing facilities. ! 

In particular, as a problem, we consider the finite element simulation of sheet 
metal forming processes of dual-phase (DP) steels, whose macroscopic material 
behavior strongly depends on its microscopic material properties. A brute force 
discretization with respect to the microscopic structure would lead to extremely 
large systems of equations, which are not feasible, even on the upcoming exascale 
supercomputers. To give an example, a reasonable finite element discretization 
down to the microscopic scale would require 10°—10° finite elements for a three 
dimensional cube with a volume of 1 um. Extrapolating this to a sheet with an 
area of 1 m? and a thickness of 1 mm would lead to 10'8—10** finite elements. A 
brute force simulation would also require knowledge of the complete microstructure 
of the steel sheet which is not available. Therefore, an efficient multiscale or 
homogenization approach is indispensable to save 3 to 6 orders of magnitude of 


'In 2011, the overall scientific goal of the German DFG priority program SPP 1648 “Software 
for Exascale Computing” (SPPEXA) was stated as “to master the various challenges related 
to [...] [the] paradigm shift from sequential or just moderately parallel to massively parallel 
processing” and thereby to “advance the frontier of parallel computing” [4]. From the beginning, 
SPPEXA aimed at a true co-design, i.e., closely connecting “computer science with the needs of 
Computational Science and Engineering (CSE) and HPC” [4]. The project EXASTEEL addresses 
three of the main SPPEXA research areas, namely computational algorithms, application software, 
and programming, i.e., we have, e.g., introduced new nonlinear solver algorithms, implemented 
our multiscale application software FE2TI, and applied hybrid programming and performance 
engineering to our codes. This work was only possible in close collaboration of mathematics, 
computer science, and engineering. 
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unknowns. Our choice of a computational homogenization approach is the FE? 
method which is well established in engineering; see Sect. 3 for a short introduction 
and further references. In the FE? method, the microscopic and macroscopic 
level are discretized independently of each other. No material laws are needed 
for the macroscopic level, all information needed is obtained from microscopic 
computations based on material laws and data on the microscopic level. Let us 
note that the microscopic problems can be solved in parallel once the solution of 
the macroscopic problem is available as input. The solution of the macroscopic 
problem, however, requires the information of all microscopic solutions. Thus, 
the FE? method is not trivially parallelizable but requires a sequential solution 
of the microscopic and the macroscopic problems; this is similar to the coarse 
level of a hybrid two-level domain decomposition method with multiplicative 
coarse level and additive subdomain solvers. The nonlinear problems on both 
levels, the macroscopic and the microscopic one, can be solved (after linearization) 
using highly parallel scalable and robust implicit solvers such as parallel algebraic 
multigrid methods (AMG) or parallel domain decomposition preconditioners such 
as FETI-DP (Finite Element Tearing and Interconnecting-Dual-Primal) [27, 28, 47— 
50] or BDDC (Balancing Domain Decomposition by Constraints) [20, 24, 71-73] 
methods. These preconditioners are usually applied as part of a Newton-Krylov 
approach, where the tangent problem in each Newton iteration is solved using 
preconditioned Krylov iteration methods. A more recent approach to nonlinear 
implicit problems, developed extensively within EXASTEEL, is given by non- 
linear parallel domain decomposition methods, which are applied directly to 
the nonlinear problem, i.e., before linearization. In such methods, the nonlinear 
problem is first decomposed into concurrent nonlinear problems, which are 
then solved by (decoupled) Newton’s methods in parallel. In this project, 
nonlinear FETI-DP and nonlinear BDDC domain decomposition methods (see 
also Sect. 6) have been introduced and have successfully scaled to the largest 
supercomputers available—independently of the multiscale context given by the 
FE? homogenization methods, which adds an additional level of parallelism. It was 
found that the nonlinear domain decomposition methods can reduce communication 
and synchronization and thus time to solution. They can, however, also reduce 
the energy to solution; see Sect. 6.1.1 and [63]. These methods can be applied 
within our highly scalable software package FE2TI but can also be used for all 
problems where implicit nonlinear solvers are needed on extreme scale computers. 
For scaling results of the FE* method to more than one million MPI ranks, see 
Fig. 3 in Sect. 3.2 and [64]. Note that these scaling results can be obtained only using 
additional parallelization on the macroscopic level. Note also that our new nonlinear 
implicit solvers based on nonlinear FETI-DP have scaled to the complete Mira 
supercomputer, i.e., 7,58,000 MPI ranks (see Fig. 15 and [57]); on the JUQUEEN 
supercomputer [44] (see [60]) our solver based on nonlinear BDDC has scaled to 
2,62,000 MPI ranks for a 3D structural mechanics problem as well as 5,24,000 
MPI ranks for a 2D problem. In the present article, the software package is used 
to derive a virtual forming limit diagram (FLD) by simulating the Nakajima test, 
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a standard procedure for the derivation of FLDs. An FLD contains a Cartesian 
coordinate system with major and minor strain values and a regression function 
of these values, which is called forming limit curve (FLC). An FLC gives the extent 
to which the material can be deformed by stretching, drawing or any combination 
of stretching and drawing without failing [77, p. v]. 

The software and algorithms developed here have participated in scaling work- 
shops at the Jiilich Supercomputing Centre, Forschungszentrum Jiilich, Germany 
(see the reports [53, 58]), as well as at the Argonne Leadership Computing Facility 
(ALCF), Argonne National Laboratory, USA. They have scaled on the following 
world-leading supercomputers in Europe, the United States, and Asia (TOP500 rank 
given for the time of use): 


JUQUEEN at the Jiilich Supercomputing Centre, Germany; European Tier 0; 
TOPS500 rank 9 in the year 2015 (458,752 cores; 5.8 petaflops); FE2TI and FETI- 
DP have scaled to the complete machine [53, 56-58, 64]; since 2015 FE2TI is a 
member of the High-Q Club of the highest scaling codes on JUQUEEN [53]. 
JUWELS at Jiilich Supercomputing Centre, Germany; European Tier 0; TOP500 
rank 23 in the year 2018 (114,480 cores; 9.8 petaflops); main source of compute 
time for the computation of an FLD; see Sect. 5 

Mira at Argonne Leadership Computing Facility (ALCF), Argonne National 
Laboratory (ANL), USA; TOP500 rank 5 in the year 2015 (786,432 cores; 
10.1 petaflops); FE2TI and nonlinear FETI-DP have scaled to the complete 
machine [54, 57] 

Theta at ALCF, USA; TOP500 rank 18 in the year 2017 (280,320 cores; 9.6 
petaflops); is ANL’s bridge to the upcoming first US exascale machine AURORA 
(or AURORA21) scheduled for 2021; BDDC domain decomposition solver has 
scaled to 193,600 cores [60] and recently to 262,144 cores 

Oakforest-PACS at Joint Center for Advanced High Performance Computing, 
Japan; TOP500 rank 6 in the year 2016 (556,104 cores; 24.9 petaflops); first deep 
drawing computations using FE2TI 


The remainder of the article is organized as follows. In Sect. 2, we introduce 
the experimental test setup of the Nakajima test and the evaluation strategy of 
major and minor strain values described in DIN EN ISO 12004-2:2008 [77]. In 
Sect. 3, we briefly describe the ingredients of our highly scalable software package 
FE2TI, including the computational homogenization method FE? and the contact 
formulation which is integrated into the software package FE2TI since the sheet 
metal deformation in the Nakajima test is caused by contact. We also present 
some strategies to reduce computing time. In Sect. 4, we describe the numerical 
realization of the Nakajima test. Then, in Sect. 5, we present numerical results of 
several in silico Nakajima simulations with different specimens resulting in a virtual 
FLC. In Sect. 6, we give an overview over our highly scalable linear and nonlinear 
implicit solvers, including nonlinear FETI-DP and nonlinear BDDC. These methods 
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can be used to solve all nonlinear problems occurring in FE2TI, as shown, e.g., 
in [64]. In Sect. 7, performance engineering aspects regarding the sparse direct 
solver package PARDISO [81] are discussed. The PARDISO sparse direct solver is a 
central building block in our implicit solvers and in FE2TI. In Sect. 8, we introduce 
different improvements on the microscopic material model to even better match 
some experimental results. 


2 Nakajima Test 


Stricter CO2 emission regulations in combination with higher passenger safety 
norms in the automotive industry requires steel grades with higher toughness and 
less weight. The class of DP steels belongs to the advanced high-strength steels and 
combines strength and ductility. Its favorable macroscopic properties result from 
the microscopic heterogeneous structure; see the beginning of Sect. 8 for further 
remarks. 

To demonstrate the macroscopic material behavior of a specific steel grade, 
different material parameters and forming behaviors have to be proven. A prominent 
member of material characterization is the forming limit diagram (FLD). It contains 
major and minor strain values at failure initiation in a Cartesian coordinate system 
and represents the forming limits of a steel grade for one specific material thickness. 
In this context, material failure is already associated with the beginning of local 
necking in the direction of thickness and not only with crack formation [77, p. v]. 
The major and minor strain values vary from uniaxial to equi-biaxial tension. 

The Nakajima test is a standard procedure in material testing. In the Nakajima 
test, a specimen is clamped between a blank holder and a die and a hemispherical 
punch is driven into the specimen until a crack can be observed; see Fig. | (left). 
Friction between the forming tool and the specimen has to be avoided as much 
as possible. Therefore, different lubrication systems can be applied; see [77, Ch. 
4.3.3.3]. To get different pairs of major and minor strains, one has to use at least five 
different shapes of sample geometries and for each shape, one has to carry out three 
different valid tests [77]. The recommended shapes of the sample sheet geometries 
are described in [77, Ch. 4.1.2], see also Sect. 4.1 and Fig. 1 (right) for an example 
of a permissible sample sheet. In experiments, the surface of a specimen is equipped 
with a regular grid or a stochastic pattern and is recorded by one or more cameras 
during the deformation process. 

There are at least two different strategies to get the pair of major and minor strains 
for the FLC, namely the cross section method [77] and a method based on thinning 
rates proposed by W. Volk and P. Hora [97]. Since the FLC gives information about 
material deformation without failing, we are interested in major and minor strains 
just before localized necking occurs. 
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\ fillet/cutoff 7 
radius 30mm 


Fig. 1 Left: Cross section of the initial test setup of the Nakajima test. Right: Dimensions of a 
specimen used for the simulation of the Nakajima test with a shaft length of 25 mm and a parallel 
shaft width of 90 mm. The inner (red) circle represents the inner wall of the die and the outer 
(green) circle represents the beginning of the clamped part between die and blank holder. Material 
outside the outer (green) circle is only considered for a width of the parallel shaft of 90 mm or 
more (dark grey) 


In the method based on thinning rates, the last recorded image before localized 
necking occurs is explicitly determined. This specific image is used to derive major 
and minor strains for the FLC. 

The cross section method is standardized in DIN EN ISO 12004-2:2008 [77]. It 
uses knowledge about the position of the crack and evaluates major and minor strain 
values in the last recorded image before crack along cross sections perpendicular to 
the crack. Then, from these values, the state immediately before material failure is 
interpolated. Cross sections have a length of at least 20 mm at both sides of the crack. 
One cross section cuts exactly through the center of the crack and one or two cross 
sections are positioned above and below with a distance of about 2 mm. For each 
cross section, we want to compute a pair of major and minor strains ies and eo, 
which represent the major and minor strains just before the beginning of plastic 
instability.” Therefore, we have to fit an inverse second-order polynomial using a 
least squares fit; see Figs. 2 and 8 (bottom). Instead of fitting inverse second-order 
polynomials to the values along the cross sections we fit second order polynomials 
to the inverse of the values. For the least squares fit we have to determine optimal 
fit windows for both sides of the crack separately; see Figs. 2 and 8 (bottom). For 
a detailed description of the procedure we refer to [77]. Let us note that ee and 
ae in general never exist during the deformation process. Hence, these numbers 
do not have a physical meaning [97]. 


Note that here and in the following, all macroscopic variables and objects are denoted with an 
overline to distinguish them from microscopic variables and objects. 
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Fig. 2 Fitted inverse second-order polynomials to the major strain values along the first cross 
section just before material failure. See also the description of the cross section method in Sect. 2. 
Optimal fit windows are computed as described in [77, Ch. 5.2.3, 5.2.4]. Left: Specimen with a 
width of the parallel shaft of 70 mm. Right: Full circular specimen 


3 FE2TI: A Highly Scalable Implementation of the FE” 
Algorithm 


For the finite element simulation of the Nakajima test, we use our FE2TI software 
package [9, 52, 57, 64], which is a C/C++ implementation of the FE? computational 
homogenization approach [29-31, 33, 70, 75, 86, 87, 91]. It is based on PETSc 
[6] and MPI. The multiscale simulations based on FE2TI and using FETI-DP and 
BoomerAMG as solvers are a “CSE Success Story” in SIAM Review [80, p. 736]. 


3.1 FE? Algorithm 


For DP steel, the overall macroscopic material behavior strongly depends on 
its microscopic properties. Assuming that the macroscopic length scale L is 
much larger than the length scale / representing the microscopic heterogeneities, 
i.e., L >> 1, the scale separation assumption is satisfied and a scale bridging or 
homogenization approach such as the FE? method can be applied. 

The idea of the FE? approach is to discretize the micro- and macroscopic scale 
separately using finite elements. The macroscopic domain is discretized without any 
consideration of the microscopic material properties, i.e., the material is assumed to 
be homogeneous from a macroscopic point of view. Additionally, a microscopic 
boundary value problem is defined on a representative volume element (RVE) 
which is assumed to represent the microscopic heterogeneities sufficiently. One 
microscopic finite element problem is assigned to each macroscopic Gauß point 
and the phenomenological law on the macroscopic level is replaced by volumetric 
averages of stresses and associated consistent tangent moduli of the microscopic 
solution. Note that the boundary values of the microscopic level are induced through 
the macroscopic deformation gradient at the integration point the microscopic 
problem is attached to. 
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To derive an RVE representing a realistic microstructure, electron backscatter 
diffraction is used; see [14]. Note that for DP steel the martensitic inclusions in 
the ferrite are quite small and widely spread, which enforces a fine discretization to 
incorporate the heterogeneities sufficiently. To overcome this problem, we make use 
of so called statistically similar RVEs (SSRVEs) [8, 83], which are constructed in an 
optimization process with only inclusions of simple geometry such as ellipsoids, but 
describe the mechanical behavior in an approximate way. Note that the constructed 
ellipsoids are simpler than the realistic microstructure and hence, the SSRVE can be 
discretized with a coarser grid. 

For further details such as the derivation of consistent tangent moduli we refer to 
the literature [33, 87] and to earlier works on computational homogenization in the 
EXASTEEL project [9, 52, 57, 64]. 


3.2 FE2TI Software Package 


The FE2TI software package was developed within the EXASTEEL project and 
has been successfully used for the simulation of tension tests of DP steel [9, 52, 57, 
64]. It belongs to the highest scaling codes on the European Tier-0-supercomputer 
JUQUEEN since 2015. 

For comparably small sizes of microscopic problems, we can solve the resulting 
tangent problems with a sparse direct solver such as PARDISO [81], UMFPACK 
[22], or MUMPS [2]. For larger sizes of microscopic problems, we have to use 
efficient parallel solvers which are also robust for heterogeneous problems. Such 
methods are Newton-Krylov methods with appropriate preconditioners such as 
domain decomposition or (algebraic) multigrid or nonlinear domain decomposition 
methods, possibly, combined with algebraic multigrid. 

In our software package, Newton-Krylov-FETI-DP and the more recent highly 
scalable nonlinear FETI-DP methods, which were developed in this project (see 
(51, 54, 59] and Sect. 6.1), are integrated. Other nonlinear domain decomposition 
approaches are the related Nonlinear FETI-1 or Neumann-Neumann approaches 
[13, 78] or ASPIN [17, 18, 35-37, 40, 41, 66]. Furthermore, FE2TI can also use 
the highly scalable algebraic multigrid implementation BoomerAMG [5, 38] from 
the hypre package [25] for micro- as well as macroscopic problems. The scalability 
of BoomerAMG was recently improved for problems in elasticity, and scalability 
of BoomerAMG to half a million ranks was then achieved within the EXASTEEL 
project [5] in close collaboration with the authors of BoomerAMG. 

For the contact simulations presented here, we consider problem sizes for which 
we can use the direct solver package PARDISO to solve the resulting tangent 
problems on the microscopic as well as on the macroscopic level. This limits the 


3https://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/FE2TI/_node.html and also https:// 
juser.fz-juelich.de/record/188191. 
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size of our problems but is suitable for mid-sized supercomputers. In our opinion, 
this is a relevant criterion for the applicability in industry. Using our parallel 
nonlinear solvers, the FE2TI package scales up to the largest machines even without 
making use of the full scaling potential of the solvers (see Fig.3); and for the 
combination of large macroscopic problems with large RVEs an exascale computer 
will be necessary in the future. While Fig.3 (left) represents a weak scaling 
study with large RVEs and comparably small macroscopic problems, in Fig.3 
(right) the macroscopic problems are larger. Therefore, in the latter case, a parallel 
macroscopic solve using GMRES with an AMG preconditioner is beneficial. The 
scalability in Fig.3 (right) somewhat suffers from an increase in the numbers of 
Newton iterations. Let us remark that the setup in Fig.3 (right) is the setup of 
a typical production run. The strong scaling potential of FE2TI is also presented 
in Fig. 4; see [9] for details. For more scalability results on different architectures 
also see [64]. 

Even if the macroscopic problem is solved with a direct solver, the assembly pro- 
cess is parallelized. For the incorporation of a material law on the microscopic level 
the software is equipped with an interface to FEAP, and we use an implementation 
of a J2 elasto-plasticity model [65]. Material parameters are chosen as in Brands et 
al. [14, Fig. 10]. 


I FE2TI Weak Scalability @ FE2TI 
184.9 185.1 185.6 185.7 186.4 1866 187.3 187.7 E FE2TI with parallel GMRES+AMG macro solve 


TNE onh 


8K 16K 32K 65K 131K 262K 393K 458K 64K 256K 512K 1024K 


S Ul UONNIOS 0} EL, oL 


200 
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#MPI ranks = #cores #MPI ranks = 4 * cores 


Fig. 3 Weak scalability of the FE2TI software on the JUQUEEN supercomputer [44]. Left: Time 
to solution of a single load step solving a three-dimensional heterogeneous hyperelastic model 
problem; uses Q1 finite elements (macro) and P2 finite elements (micro); 1.6M d.o.f. on each RVE; 
512 FETI-DP subdomains for each RVE; the macroscopic problem size grows proportionally to 
the number of MPI ranks while the microscopic problem size is fixed; corresponding data in [57, 
Tab. 2]; High-Q club computation in 2015. Right: Total time to solution for 13 load steps solving 
3D heterogeneous plasticity; uses Q1 finite elements (macro) and P2 finite elements (micro); 200K 
d.o.f. on each RVE; 64 FETI-DP subdomains for each RVE; the macroscopic problem is increased 
proportionally to the number of MPI ranks; for the larger problems using parallel AMG for the 
problem on the macroscale, instead of a sparse direct solver, is beneficial; see also [64, Fig. 15] 
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Fig. 4 Strong scalability of the FE2TI software for a nonlinear elasticity problem. Macroscopic 
problem with 32 finite elements; each RVE with 107K degrees of freedom is solved using 512 
FETI-DP subdomains. Simulation of one macroscopic load step. Left: Total time to solution; 
Right: Speedup. Figures from [9] 


3.3 Contact Kinematics and Incorporation of Contact 
Constraints for Frictionless Contact in FE2TI 


For the simulation of the Nakajima test, we have to consider contact between 
the deformable specimen 8 and different rigid tools 7;, i = 1,2,3, namely 
the hemispherical punch, blank holder, and die; see Fig. 1 (left). Therefore, we 
implemented a contact algorithm on the macroscopic scale in FE2TI. To simplify 
the notation, we consider an arbitrary rigid tool 7 in the following. 

A general convention in contact formulations is to consider one contact partner 
as the master body and one contact partner as the slave body [68, 99]. 

Only points of the contact surface of the slave body are allowed to penetrate 
into the master body. Following [68], one can choose the rigid surface as slave 
surface as well as a master surface, and in [99, Rem. 4.2] it is recommended to 
use the rigid surface as master surface; we have applied the latter in our simulations. 
Nevertheless, the contact contributions to the stiffness matrix and the right-hand- 
side are computed in the coordinate system of the deformable body. 

In every iteration, we have to check for all finite element nodes xg € Fg of the 
contact surface of 8 whether it penetrates into F or not; see Fig. 5 for a simplified 
illustration. For each xg € Pe we have to determine the related minimum distance 
point xo" = ming eT. xg — X|| of the contact surface of T. Now, we can 
formulate a non-penetration condition 


—min 
n 


Zne = Og XE"): m" > 0, Yg € Ig. (1) 


Alternatively, for all points xg € Te:= Ez € Tg | ENP < o} which penetrate into 
the master body, the amount of penetration can be computed by 


ie) np”, Xg E Te, (2) 
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Fig. 5 Illustration for the determination of active contact nodes and the amount of penetration 


and is set to zero for all other points. Here, 7!" 


tool at ¥@!"; see Fig. 5. 

Since the contact partners of the sheet metal are assumed to be rigid, the tools 
are not discretized by finite elements but the contact surfaces are characterized by 
analytical functions. This also simplifies the computation of the related minimum 
distance point and, hence, the computation of the outward normal direction as well 
as of the amount of penetration. For a detailed description of contact between two 
deformable bodies we refer to [99, Ch. 4.1]. 

As in standard finite element simulations of continuum mechanics problems, we 
are interested in the minimization of an energy functional fl, but under additional 
consideration of an inequality constraint due to the non-penetration condition 
(Eq. (1)). Constrained optimization can be performed, e.g., using the (quadratic) 
penalty method, where an additional penalty term is added to the objective function 
if the constraint is violated; see [76, Ch. 17]. Let us note that the incorporation of 
contact constraints into a finite element formulation does not change the equations 
describing the behavior of the bodies coming into contact [99]. 

There are several different strategies to incorporate contact constraints into 
finite element formulations, where the penalty method and the Lagrange multiplier 
method are the most prominent members; see [99, Ch. 6.3]. The penalty method 
is the most widely used strategy to incorporate contact constraints into finite 
element simulation software [99, Ch. 10.3.3] because the number of unknowns does 
not increase. In comparison to the Lagrange multiplier method [99], the contact 
constraints are only resolved approximately and a small penetration depending 
on the choice of the penalty parameter €y > 0 is allowed. For €y — ov, the 
exact solution of the Lagrange multiplier method is reached, but for higher penalty 
parameters £y, the resulting system of equations becomes ill-conditioned [99]. For 
a suggestion of a choice of the penalty parameter €y, see [99, Remark 10.2]. 


is the outward normal of the rigid 
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Using the penalty method, we have to add an additional term 
as lZ _, 
Tp= |. 5 -EN : gy dA 


to the energy functional TI for all active contact nodes xg € Te [99, Ch. 6.3]. 
The penalty parameter €y can be interpreted as the stiffness of a spring in the 
contact interface; see [99, Ch. 2.1.3]. Let us note that the definition of an active 
set is different from standard textbooks as [76, Def. 12.1], where points belong 
to the active set if they fulfill equality of the inequality constraint. Other authors 
like Konyukhov and Schweizerhof considered the Heaviside function to follow the 
common definition of an active set; see, e.g., [67, 68]. Since the energy functional is 
changed due to the contact constraints, also the resulting stiffness matrix and right- 
hand-side are affected. 


3.4 Algorithmic Improvements in FE2TI 


In simulations making use of load stepping (or pseudo time stepping) as a global- 
ization strategy, as is the case in FE2TI (see Sect. 4), the time to solution strongly 
depends on the number of load steps as well as on the number of macroscopic 
Newton iterations per load step. The required time of a single macroscopic Newton 
step again depends on the time to solution of the microscopic problems. 

While a large load step may seem desirable, it can lead to slow convergence 
or even divergence; convergence can be forced by reducing the load step size thus 
increasing the total number of load steps; this can be observed in Table 1. To adapt 
the size of the load steps, we use a simple dynamic load step strategy; see Sect. 3.4.1. 

Keeping the number of macroscopic Newton iterations as small as possible is 
directly related to a good choice of the initial value of a single load step. For a 
better prediction of the initial value, we use linear extrapolation; see Sect. 3.4.2. 
This is especially relevant for our contact problems where quadratic convergence of 
Newton’s method can be lost. 


Table 1 First 2 mm displacement of the forming tool with constant load increments 7 of 3.125e—3 
(first row) and 1.0e—1 (second row) for the specimen with a parallel shaft width of 50 mm; two 
MPI ranks per core; computed on JUWELS [45] 


Cov. dist. | Load | Newton | @ Newton its. @ Time per| @ Time per 
punch steps | its. per load step | Runtime | load step | Newton it. 
Load increment | 2 mm 640 | 966 1.51 7595.30 s| 11.87 s 7.86 s 
3.125e—3 
Load increment | 2 mm 20 | 130 6.50 1407.03 s| 70.45 s 10.82 s 


1.0e—1 
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3.4.1 Dynamic Loadstepping 


Our load step strategy depends on macroscopic as well as microscopic information. 
The macro load increment T is reduced when microscopic stagnation is observed 
or when a maximum number of macroscopic Newton iterations per load step is 
reached. Stagnation on the RVE level is detected whenever the norm of the current 
microscopic Newton iteration compared to the previous one does not reduce after 
more than six microscopic Newton iterations or if the number of microscopic 
Newton iterations is larger than 20. The load step is increased based on the number 
of macroscopic Newton iterations per load step. Note that / is not allowed to exceed 
a predefined maximum load increment T"". For details, see Fig. 6. 

Whenever stagnation in a microscopic problem occurs, the microscopic solver 
gives this information to the macroscopic level and the load step is repeated with 
a reduced load increment. Otherwise, stagnation of a microscopic problem would 
lead to a simulation break down due to a missing tangent modulus and stresses in 
the macro Gauf point where the micro problem is attached. 


3.4.2 (Linear) Extrapolation 


For Newton-type methods, it is important to choose a good initial value. If the initial 
value is close to the solution, only a few Newton iterations are necessary. As in [64], 


Macroscopic Load Step k 


Convergence within No Convergence within 
20 Newton iterations 20 Newton iterations 
Not more than 50% of More than 50% of m” || < 1.4tol ie] > 1.4tol 
Newton iterations of Newton iterations of 
previous load step k — 1 previous load step k — 1 + + 
vA 4 = > more I; = 0.5- i; 
= lewton iterations Repeat load step k 


har = 251 La =1 
= k +l - k -max Ik El Ix vA y 
If lea, Si set Iza, = 1 Continue with 


Continue with load step k + 1 load step k + 1 Convergence] | No Convergence 


L4 X 


Ik = 0.5-1 Ik = 0.5 -Ip 


Continue with load step k + 1 Repeat load step k 


Fig. 6 Schematic procedure of reduction and increase of the load increment 7 depending on 


n zmax . * — . $ 
macroscopic events. Here,/ is a predefined maximum load, ze] is the norm of the solution 


of the 20th Newton iteration of the current load step k, and tol is the Newton tolerance 
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we use linear extrapolation based on the converged solutions mwg and ug—1 of the 
take —(0) 
former two load steps to guess a good initial value w}; 4> 


let — Ik- 

—(0 = k+1 k—1 = = 

TO, = Uk-1 + =. (Uk = Uk—1)- 
De — k-i 


In [64], this technique was already successfully used in the FE? simulations using 
the FE2TI software package without contact constraints. For results using second 
order extrapolation, we refer to [95, Ch. 4.2.2]. 


3.4.3 Checkpoint/Restart 


To perform the simulation of the Nakajima test until material failure of a sam- 
ple sheet, i.e., until a failure zone occurs in the metal sheet, often significant 
computation times are needed, even if the full supercomputer is available. To 
reduce the consequences of hardware failures and also to overcome specific wall 
time limits on HPC systems, we equipped our FE2TI package with synchronous 
Checkpoint/Restart (CR). We integrated the CRAFT library (Checkpoint/Restart 
and Automatic Fault Tolerance) [88], which was developed in the second phase 
of the SPPEXA project ESSEX. Let us note that we use synchronous application 
level checkpointing with different checkpoints for the macroscopic level and the 
microscopic level. 

In [21], different checkpoint intervals are introduced based on the expected run 
time of the simulation and the mean time of hardware failure of the HPC system 
the software is performed on, but in all our simulations presented here, we choose 
a fixed checkpoint interval of 75 load steps. Here, we do not take into account that 
the load increment may change during the simulation and that small load increments 
are usually faster. As an improvement, the checkpointing could take into account the 
displacement of the forming tool or a fixed wall time interval could be used which 
also could depend on the mean time of hardware failure. 


4 Numerical Simulation of the Nakajima Test Using FE2TI 


For the simulation of the Nakajima test, we use our highly scalable parallel software 
package FE2TI; see Sect. 3. For the solution of the boundary value problems on 
both scales, we here use the sparse direct solver package PARDISO [81]. Since 
we are considering a DP600 grade of steel, we use a fitted J2 elasto-plasticity 
model on the microscopic level; see [14, Fig. 10]. Throughout this article, we use 
structured Q2 finite element discretizations for the sample sheet and an unstructured 
P2 finite element discretization for the RVEs. Both, the macroscopic as well as the 
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microscopic meshes, are generated using the open source software package GMSH 
[34]. We use the load stepping and extrapolation described in Sects. 3.4.1 and 3.4.2. 

In the Nakajima test, the macroscopic deformation is driven by the rigid punch. 
Hence, load stepping is applied to the movement of the forming tool, where the 
hemispherical punch moves a small step in upward direction in each load step. 

Since in reality a tribological system is set up to avoid friction between the 
hemispherical punch and the sheet metal [77], we consider frictionless contact. 
Hence, we have to deal exclusively with contact conditions in normal direction of 
the rigid tools. Considering frictionless contact, we neglect friction between the 
specimen and the blank holder or die. 


4.1 Description of Specimen Geometry 


In our simulations, we consider specimens with a central parallel shaft and also a 
completely circular specimen. The length of the parallel shaft is 25 mm and the fillet 
radius is 30 mm, which both fit to the normed range in [77]; also see Fig. 1 (right). 

For all specimens, the material is assumed to be completely clamped by the 
bead, which has a radius of 86.5 mm. We therefore only consider material points 
p= (Px Py, P) which have a distance of at most 86.5 mm to the center of the 
sample sheet; see Fig. 1 (right) for an example of a sample sheet with a parallel 
shaft width of 90 mm. In our simulations, the sample sheet is always 1 mm thick, 
and we consider specimens with a parallel shaft width of 30, 50, 70, 90, 100, 110, 
125, and 129 mm as well as the completely circular sample sheet. Note that the 
center T? = (Cx, Cy, T? ) of the bottom surface of all sample sheets is placed in the 
origin of the coordinate system. 

The specifications of the rigid tools are also within the range given in [77]. The 
radius of the hemispherical punch is 50 mm. The blank holder is a square plate of 
173 mm x 173 mm with a circular hole in the middle with a radius of 55 mm; see 
the inner circle in Fig. 1 (right). The die is placed within a distance of 5 mm to the 
rigid punch, i.e., the inner wall of the die also has a radius of 55 mm; see, again, 
the inner circle in Fig. 1 (right). The die radius (see Fig. 1 (left)) is 10 mm, i.e., all 


material points p with , |p + 7 > 65 are potentially clamped; see the outer circle 
in Fig. 1 (right). 

For all sample sheets with a parallel shaft width less than 90 mm as well as 
for the completely circular specimen, we only consider material points p with 
i |p + P < 65 and choose all points with , |p + P- = 65 as Dirichlet boundary 
nodes. For specimens with wider parallel shaft widths, we choose boundary 
conditions analogously to [43]; see also [95]. 
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ova 


Fig. 7 Left: Symmetric computational domain from the full sample sheet and additional Dirichlet 
boundary conditions along the symmetric boundary surfaces. Right: Orientation of the RVEs for 
the different quadrants of the full geometry of the sample sheet after mirroring the first quadrant 
due to the symmetric assumption 


4.2 Exploiting Symmetry 


The setup of the Nakajima test is symmetric. Assuming that the failure zone evolves 
symmetrically, i.e., along the vertical center line, it is sufficient to only simulate a 
quarter of the full geometry and to rebuild the full solution by mirroring; see Fig. 7 
(left). This is only exact, if the RVEs are also symmetric, since mirroring of the 
macroscopic solution also implies mirroring of the RVEs; see Fig. 7 (right). Hence, 
for an asymmetric RVE, we violate the assumption of a periodic unit cell because 
mirroring leads to a change in orientation for all four quadrants. In this case, the 
solution generated by the symmetric assumption is only an approximation to the 
solution of the simulation using the full geometry of the sample sheet, even for the 
first quadrant of the full geometry. Nevertheless, we use the symmetric assumption 
throughout this article, even when the RVEs are asymmetric, to reduce the number 
of MPI ranks by 75%; see Sect. 3.4. As a sanity check, we have also performed 
simulations for the full geometry; these will, however, be presented elsewhere due 
to space limitations. 

For the simulation of one quarter, we have to add further boundary conditions 
along the symmetric boundaries of the quarter. Then, we have to fix all y-coordinates 
of macroscopic material points p with P, = Cy. Analogously, we have to fix all x- 
coordinates for macroscopic points with P, = Cx; see Fig. 7 (left) and Sect. 4.1. 


4.3 Failure Criterion 


For the detection of macroscopic crack initialization, we have to formulate a failure 
criterion and a maximum critical value, which will indicate the initialization of 
failure. Note that, for the computation of the forming limit, we do not need to 
simulate cracks or crack propagation. Instead, we are only interested to compute 
when structural failure occurs. 
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We use the Cockcroft and Latham criterion [19], 
W: = We) = f max (71 (@), 0) da, 6) 
0 


which was used by Tarigopula et al. [93] for analyzing large deformation in DP 
steels. It depends on the maximum principal stress component F; and the equivalent 
plastic strain &;, := @(t,) at load step k (pseudo time ¢;,) and is integrated over @. 
Since g depends on the load step, this also holds for the failure criterion W and the 
stress &. In general, in FE*, & is not known explicitly but can be approximated by 
the volumetric average* @ := (a(t,)) from the microscopic level at load step k. 

The value of W is computed at each Gauß point and is accumulated throughout 
the loading process until at least one Gauß point exceeds a critical value We at 
which failure initializes, i.e., W > We is fulfilled. Tarigopula et al. provide the 
value We = 590 — 610 MPa for DP800 grade of steel; see [93]. Since we consider 
DP600 grade of steel, the critical value should be smaller in our case. 

Equation (3) is approximated by numerical integration and using & 


a k 
We ~ We) = | max E0), 0) dë x X max E&i), 0) @ ~-1) 
0 


i=1 


= We-1 + max (F7 (Č), 0) - (@ — x-1). (4) 


where (œ; — @—1) is nothing else than the increment of the equivalent plastic strain 
from load step i — 1 to load step i and Wọ = @ = 0. Hence, we can sum up the 
failure criterion W over all load steps and summation is performed whenever a load 
step reached convergence. See Fig. 9 for an example of the evolution of the failure 
criterion W during a Nakajima simulation. 

Let us note that the failure criterion is formulated such that ductile failure takes 
place due to tensile stresses and shear stresses, where the effect of tensile stresses is 
considered by usage of the maximum positive principal value of Cauchy stress and 
the effect of shear stresses and work-hardening is considered through the equivalent 
plastic strain increment. 


4.4 Numerical Realization of the Experimental Cross Section 
Method 


In the experiment, major and minor strains £; and €2 are evaluated at the top surface 
of the sample sheets along cross sections. However, the simulation only provides 


4Note that here and in the following, all volumetric averages obtained from microscopic level are 
in brackets ( ). 
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exact macroscopic values in the integration points, which generally do not coincide 
with the finite element nodes. Therefore, the simulations do not provide any exact 
macroscopic values on the sample sheets surface, and we decided to consider cross 
sections along those GauB points closest to the upper surface. 

For the computation of major and minor strains €; and €2 we transform our 
resulting 3D strain tensor to the 2D plane strain tensor and then follow the strategy 
as described in [97]. For further details, we refer to [95]. 

In this article, we show results for computations using the symmetric test setup 
of the experiment; see Sect. 4.2. This automatically implies that we assume that the 
failure zone evolves along the vertical center line and the center of the failure zone 
is identical to the center of the upper surface of the sample. For the computations 
using symmetry, all discretizations have finite element nodes at the center of the 
probe. Keeping in mind that we choose cross sections along Gauß points, no cross 
section cuts through the center of the failure zone. To keep the distance between 
the first cross section and the center of the failure zone as small as possible, we 
consider integration points with the smallest distance to the horizontal center line as 
first cross section. For simplicity, the other cross sections are along the remaining 
Gauß points of the same finite elements which were used for the first cross section. 
Hence, the distance between the cross sections depends on the diameter of those 
finite elements and is smaller than 2 mm in our computations. 

Due to the symmetric computations, we only have one side of the cross sections 
at hand but the other side can be simply generated by mirroring; see Fig. 2 and the 
upper pictures in Fig. 8. 

Note that the cross section method can only be used for specimens with a single 
failure zone. Unfortunately, in our simulations the failure zone does not always 
evolve along the vertical center line but parallel to it for sample sheets with a parallel 
shaft width of 100 mm or more. Hence, mirroring leads to the occurrence of a 
second failure zone; see Figs. 11 (left) and 12. For these specimens, first simulations 
using the upper half or the complete domain of the sample sheet also lead to two 
failure zones parallel to the vertical center line. For simulations obtaining two failure 
zones, we assume that the maximum major strain along the cross section defines the 
position of the failure zone. Neglecting all values between the vertical center line 
and the maximum major strain and shifting the failure zone to the vertical center 
line, we can proceed as before; see Fig. 8. 

So far it is not clear, whether the symmetry assumptions hold for specimens with 
failure zones parallel to the vertical center line. In future work, we have to perform 
a comparison with the full geometry. 
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5 The Virtual Forming Limit Diagram Computed with 
FE2TI 


Since we use the symmetry assumptions for the numerical simulation of an FLC, we 
perform our simulations on a quarter of the full specimen and rebuild the solution by 
mirroring; see also Sect. 4.2. The computational domain is discretized by structured 
Q2 finite elements, where the number of finite elements depends on the width of the 
parallel shaft as well as on the chosen boundary conditions; see Sect. 4.1 as well as 
Table 2. Note that we use (only) two Q2 finite elements in the direction of thickness 
for all specimens. 

For the microscopic problems, we consider an SSRVE with two ellipsoidal 
inclusions discretized by 1470 P2 finite elements and 7152 d.o-f. in an unstructured 
manner. As mentioned before, the resulting tangent problems on the microscopic 
level are solved by using the direct solver package PARDISO [81] and each problem 
is solved on an individual MPI rank. Hence, the number of macroscopic finite 
elements automatically determines the number of required MPI ranks, which is 
27 times the number of finite elements. We have performed all our simulations on 
JUWELS [45] using 2 MPI ranks per core and a penalty parameter of 500. For the 
specifications of the rigid tools, we refer to Sect. 4.1. 

As an initial load increment, we choose 0.1 mm and define Tm” = 0.2 mm 
as maximum allowed load increment. Our stopping criterion is either based on 
reaching a predefined covered distance of the forming tool of 40 mm or on the 
load increment and not on the failure criterion, since we have only little experience 
how to choose the critical value W, to detect failure. Simulation stops if the 
load increment of 10 successive load steps is smaller than a predefined allowed 
minimum, which is the initial load increment multiplied by 1074, or, if the load 
increment has to be reduced 7 times within a single load step. This is motivated 
by the fact that small load increments indicate hard numerical problems which are 
expected in case of material failure. 

We have summarized some data on our Nakajima simulations including the 
number of restarts in Table 2. Note that most restarts are caused by reaching the 
requested wall time and only in few cases, if any, by hardware errors. 

For all specimens with a parallel shaft, we obtain comparable results, which are 
characterized in the following. After a certain covered distance of the tool, a local 
increase in the failure value W, the major strains 1, the equivalent plastic strain &, 
the thinning rate, and the von Mises stress can be detected almost simultaneously 
along an area parallel to the vertical center line. Later, the values continue to rise, 
especially in this area, so that the degree of localization increases; see Fig. 9. Finally, 
some microscopic problems within the aforementioned localized area cause the 
simulation to end. At this point, however, a pronounced change in thickness can be 
observed within the localized area, which can be associated with material failure; 
see cross sections in Figs. 12 and 10 (top right) as well as the upper right picture in 
Fig. 11 (left). 
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Cockroft and Latham Criterion W (in MPa) 
0.0e+00 50 100 150 200 250 300 350 400 450 500 5506.0e+02 


Lood Step 300 


Lood Step 100 Load Step 200 
Covered Dist.: 20.77mm 


Covered Dist.: 9.65ene Covered Dist.: 15,82 


Load Step 400 
Covered Dist.: 23.54mm 


Load Step 600 


Load Step 500 tep 
Covered Dist: 2708mm 


Covered Dist.: 25.82mm 


Fig. 9 Evolution of failure criterion W during the simulation for the specimen with a width of the 
parallel shaft of 50 mm 
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Fig. 10 Final solution of the simulation with a specimen with a width of the parallel shaft of 
70 mm; z-direction (top left), thickness (top right), von Mises stress (bottom left), major strain 
(bottom center), and thinning rate (bottom right). The grey area is not simulated since material is 
totally clamped between die and blank holder 


For the completely circular specimen we do not observe any localized effects 
along an area parallel to the vertical center line, even if we reach the predefined 
distance of 40 mm. Instead, we obtain a similar behavior over nearly the complete 
contact area; see Fig. 11 (right). 

The results for specimens with a parallel shaft can be divided into two groups. 
The first group contains all samples with a parallel shaft width of at most 90 mm. For 
these specimens material failure occurs along the vertical center line; see Fig. 10. 
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Fig. 11 Final solution of the simulation with a specimen with a width of the parallel shaft of 
100 mm (left) and the completely circular specimen (right); variables and color bars as in Fig. 10. 
Left: Material between blank holder and die is simulated since material movement is allowed. 
Right: Material between blank holder and die is assumed to be clamped (dark grey) and hence is 
not simulated 
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Fig. 12 Final results of distribution of Cockcroft and Latham failure value W for all Nakajima 
simulations and associated Forming Limit Diagram (FLD) with FLC (black curve) for We = 450. 
In the cross section, one can identify local necking in thickness for all but the completely circular 
specimen 


All specimens with a wider parallel shaft belong to the second group, which can 
be characterized by material failure parallel to the vertical center line. For these 
samples, mirroring of the computed solution leads to the occurrence of a second 
failure zone; see Figs. 11 (left) and 12. As mentioned before, first numerical tests 
indicate that this also holds when simulating the complete specimen or the upper 
half of it. Hence, we decided to use the results to determine an FLC. The adaptions 
to evaluate major and minor strains for failure zones parallel to the vertical center 
line are mentioned in Sect. 4.4; see also Fig. 8. Without manipulating the values 
along the cross sections, we would obtain unphysically high values in the FLD. 
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Based on all available results we have subsequently defined a critical value of 
We = 450 for the failure criterion. Hence, we have to find for all simulations the 
corresponding load step for which the failure value in at least one point exceeds the 
critical value W, = 450 for the first time; see Table 3. 

When we have found the corresponding load steps, we have to compute the major 
and minor strain values along the cross sections perpendicular to the failure zone 
and generate for each cross section one point in the FLD. So, we come up with 
30 different pairs of major and minor strains wherein each three belong to one 
specimen; see Fig. 12. The final FLC is derived by regression; see black curve in 
Fig. 12 and [95] for details. 


6 Linear and Nonlinear Parallel Solvers 


For large scale computations using FE2TI, scalable parallel implicit solver algo- 
rithms are needed for the problems on the microscale as well as the macroscale [64]. 
Another focus of the EXASTEEL project therefore was on solver algorithms, 
i.e., efficient and highly parallel scalable implicit solvers for linear and nonlinear 
problems arising from a finite element discretization of linear and nonlinear partial 
differential equations; see, e.g., [5, 54-57, 59-61, 63]. 

Here, nonlinear domain decomposition (DD) approaches, namely nonlinear 
FETI-DP (Finite Element Tearing and Interconnecting-Dual Primal) and nonlinear 
BDDC (Balancing Domain Decomposition by Constraints) methods, have been 
introduced in the first phase of EXASTEEL and improved during the second phase, 
where also new variants were introduced. 

In [64], our new nonlinear FETI-DP solver algorithms were then applied within 
large FE2TI simulations for the first time: We have used Nonlinear-FETI-DP-1 as 
a parallel implicit solver for the microscopic problems using 114,688 KNL cores 
of the Theta many-core supercomputer (Argonne National Laboratory) [64, section 
4.9]. However, the nonlinear DD methods developed in the project have a broad 
range of applications, e.g., in hyperelasticity, elasto-plasticity, or nonlinear diffusion 
problems. 

To further improve the performance of the nonlinear solvers, also the efficiency 
and parallel scalability of all building blocks was in the focus of the EXASTEEL 
project [5, 54-56, 60, 61]. 

In this section, we describe very briefly our nonlinear domain decomposition 
approaches and sum up the achievements and highlights obtained within the past 6 
years. 
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6.1 Nonlinear FETI-DP Framework 


Classical domain decomposition methods are robust and highly scalable divide-and- 
conquer algorithms for the iterative solution of discretized linear partial differential 
equations. In the case of FETI-DP methods [27, 28, 47—50], the computational 
domain is decomposed into nonoverlapping subdomains which are distributed to 
the available compute cores. FETI-DP methods are well established in structural 
mechanics and have been awarded a Gordon Bell price [10]. 

The robustness and scalability originates from the sparse direct solvers applied 
on the subdomains, combined with a carefully designed coarse problem. The coarse 
problem, though necessary for robustness, is a potential scaling bottleneck, since 
its size grows with the number of subdomains, i.e., with the number of parallel 
cores. In order to retain scalability, the coarse solution can be approximated, e.g., 
by algebraic multigrid methods; see [46, 48]. Finally, to solve nonlinear problems, 
a linearization with Newton’s method is usually applied first, and the linearized 
tangential systems are then solved, e.g., by FETI-DP. We refer to the latter approach 
as Newton-Krylov-FETI-DP. 

In contrast, in nonlinear FETI-DP or BDDC methods, first introduced in [51], 
the discretized nonlinear partial differential equation is decomposed into small and 
independent nonlinear problems before linearization. In the case of nonlinear FETI- 
DP, a nonlinear coarse problem is added by strongly coupling the local nonlinear 
problems in few variables on the interface of the domain decomposition, as, e.g., 
vertices or averages over edges or faces. This leads to a nonlinear FETI-DP saddle 
point system; see, e.g., [51, eq. (3.2)], [54, eq. (3.4)], or [59, eq. (1)]. Also, a 
partially nonlinear elimination process of sets of variables from the nonlinear FETI- 
DP saddle point system is possible before linearization. The nonlinear elimination 
process can be interpreted as a nonlinear right-preconditioner, which we described 
in [59, Section 2.5] in detail. There we also introduced a unified framework 
to describe different nonlinear FETI-DP and BDDC methods. The selection of 
the elimination set finally defines the nonlinear FETI-DP method precisely. We 
discussed four canonical choices in [59], but other choices are feasible and possibly 
beneficial. Let us briefly repeat the four variants from [59]. In FETI-DP, all degrees 
of freedom or variables are divided into three classes. First, all variables belonging 
to the interior of the subdomains are grouped into the set J (marked as circles 
in Fig. 13), second, all variables of the global coarse problem are grouped into 
the set II of so-called primal variables (marked as squares in Fig. 13), and third, 
all local degrees of freedom on the interface are grouped into the set A of so- 
called dual variables (marked as dots in Fig. 13). Let us remark that continuity of 
the solution in the dual degrees of freedom is enforced iteratively by enforcing 
zero jump constraints with a Lagrangian approach; see [27] for details. Finally, 
Nonlinear-FETI-DP-1 is defined by eliminating nothing, Nonlinear-FETI-DP-2 by 
eliminating everything, Nonlinear-FETI-DP-3 by eliminating the interior and dual 
variables, and finally Nonlinear-FETI-DP-4 by eliminating the interior variables; 
see Fig. 13 for a visualization of the different variants. 
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Fig. 13 The coarse problem is marked with squares, the interior degrees of freedom by circles, and 
the dual degrees of freedom by dots. The colored degrees of freedom are eliminated nonlinearly 
before linearization in the different variants. From left to right: Nonlinear-FETI-DP-1, Nonlinear- 
FETI-DP-4, Nonlinear-FETI-DP-3, Nonlinear-FETI-DP-2 
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Fig. 14 Comparison of classical Newton-Krylov-FETI-DP (NK), Nonlinear-FETI-DP-1 (NL1), 
Nonlinear-FETI-DP-2 (NL2), Nonlinear-FETI-DP-3 (NL3), and Nonlinear-FETI-DP-4 (NL4) 
using the JUQUEEN supercomputer. Left: Nonlinear p-Laplace inclusions not touching the 
interface; 40k d.o.f. per subdomain; Right: Nonlinear p-Laplace channels crossing the interface; 
160k d.o.f. per subdomain. For a detailed discussion of the results see also [59, Fig. 10 and 12] and 
corresponding descriptions. If an appropriate nonlinear elimination set is chosen for the problem 
(as is the case here in NL3) then the nonlinear method outperforms the classical Newton-Krylov 
approach significantly 


If the elimination set is chosen appropriately, nonlinear FETI-DP methods can 
outperform classical methods, i.e., Newton-Krylov-FETI-DP. In [42], a heuristic 
approach is suggested to eliminate the areas with strong nonlinear effects. For 
illustration let us consider combinations of the nonlinear p-Laplace equation and 
the linear Laplace equation, where the nonlinearities either touch the subdomain 
interface or are at a distance. In the latter case, we choose nonlinear inclusions 
of p-Laplace enclosed in the subdomains and in the first case nonlinear channels 
of p-Laplace cutting certain edges. For a detailed description of the chosen 
model problem, see [59, Section 5.1, Fig. 7]. Considering the inclusions, all 
nonlinear FETI-DP methods work well. Considering the example with channels, 
it is necessary to at least eliminate the dual interface variables; see Fig. 14 or [59]. 
Let us note that Nonlinear-FETI-DP-1 performs well in both cases, which is a result 
of a careful choice of the initial value; see [51, Section 3.3] for details. 
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E Nonlinear-FETI-DP-1 — Ideal Scaling 


Time to Solution in s 


16 64 256 1K 4K 16K 65K 262K 786K 
#Cores 


Fig. 15 Weak scalability of Nonlinear-FETI-DP-1 for a p-Laplace problem computed on the Mira 
supercomputer; largest problem has 62.9 billion d.o.f.; results from [57]; see also [57, Fig. 4 (left)] 


Another benefit of nonlinear domain decomposition approaches is the localiza- 
tion of work, which increases the scalability of these methods. This can be verified 
either in Fig. 14 or in our larger weak scaling experiments on Mira published 
in [57] and presented in Fig. 15, where an algebraic multigrid preconditioner from 
the BoomerAMG package [38] is used to approximate the coarse solve to obtain 
scalability without losing robustness. 

We have also considered approaches to improve the convergence of nonlinear 
FETI-DP methods. We have introduced heuristics to determine if a nonlinear 
elimination is useful in a certain Newton step or not. Additionally, the elimination 
process is approximated up to a necessary tolerance to save computational cost. 
This approach is called NL-ane (approximate nonlinear elimination) and is also 
discussed in [59, 62]. We recently also considered a globalization strategy using 
the SQP approach; see Sect. 6.1.2. 

Finally, we successfully investigated a hybrid parallelization of FETI-DP using 
PARDISO in [55], and also considered nonlinear FETI-DP and BDDC methods 
where the sparse direct solvers are replaced by preconditioners; see [56, 60, 61]. 


6.1.1 Improving Energy Efficiency 
In classical Newton-Krylov methods, global synchronization occurs in every Krylov 


iteration. Global synchronization can be significantly more coarse-grained in our 
nonlinear domain decomposition methods since the Krylov iteration can be asyn- 
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Fig. 16 Power per core for Newton-Krylov-FETI-DP (NK) and Nonlinear-FETI-DP-3 (NL3) with 
normal barrier (b) and test-sleep approach (b-ts) on Meggie cluster. Each subdomain problem 
has 160k d.o.f. Left: Nonlinearity in each of the subdomains; Right: Nonlinearity in a single 
subdomain. See also [63, Fig. 10] and corresponding descriptions for details 


chronous between subdomains. In this section, we describe how this can be 
exploited to save energy when load imbalances occur. 

If the nonlinear elimination set in nonlinear FETI-DP is completely local to the 
subdomains as, e.g., in Nonlinear-FETI-DP-3, the nonlinear subdomain problems 
can be solved in parallel and asynchronously. This solution process using Newton’s 
method can introduce a load imbalance, even for perfectly balanced meshes, if the 
number of Newton iterations is different between subdomains; see [63, Fig. 7]. 
Note that even for these cases, Nonlinear-FETI-DP-3 typically has a shorter time 
to solution compared with classical approaches; see Fig. 14. 

In [63], we have suggested to use a nonblocking barrier in combination with 
a sleep statement to set idling cores in deep sleep states, to reduce energy 
consumption. This is feasible in nonlinear parallel domain decomposition since 
the synchronization between the cores is coarse grained (typically larger than | s). 
During these phases sleeping cores wake up every 10 ms. The wake-up latency itself 
for current Intel processors is significantly below 1 ms. Therefore, the overhead of 
the sleep and wake up approach is insignificant compared to the time spawn of 
global synchronization and does not impact overall performance or scalability. We 
call this method test-sleep approach. To investigate the energy saving potential, we 
measured the energy consumption of Nonlinear-FETI-DP-3 in [63] reading out the 
RAPL hardware counters with LIKWID [94] on the Meggie? cluster at Erlangen 
Regional Computing Center (RRZE). In Fig. 16, we present the power consumption 
per core for two different scenarios, i.e., a single nonlinear inclusion in a single 
subdomain or a single nonlinear inclusion in each subdomain. The total energy 


Standard Cluster with Intel Omnipath Interconnect and two Intel Xeon E5-2630 v4 chips per 
node. 
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Fig. 17 Total energy to solution for Newton-Krylov-FETI-DP (NK) and Nonlinear-FETI-DP-3 
(NL3) with normal barrier (b) and test-sleep approach (b-ts). Computation on 5120 Meggie cores. 
See also [63, Fig. 9] for a complete weak scaling study 


consumed by the nodes during the solution procedure on 5120 cores is shown 
in Fig. 17. The test-sleep approach also works for alternative nonlinear domain 
decomposition methods, as, e.g., ASPIN [17]; see [63] for a brief discussion using 
the ASPIN implementation described in [15] which is provided in PETSc. 


6.1.2 Globalization 


We consider different globalization strategies for our nonlinear domain decompo- 
sition methods. For the different nonlinear FETI-DP methods, we consider trust 
region methods and also an approach based on the SQP (sequential quadratic 
programming) method using the exact penalty function Ps (ü) = J(u) + B\|Bal|y, 
where J denotes the mechanical energy and Bu are the FETI-DP equality con- 
straints; see Table 4 for some results. 


Table 4 Number of global iterations for a snap through buckling problem for compressible Neo- 
Hookean energy with material parameters E = 210, v = 0.3 in 2D; 100 subdomains, 20,402 
degrees of freedom; — : failed 


Volume force 
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Fig. 18 Left: Weak scaling experiment on the JUQUEEN supercomputer; deformation of a 
hyperelastic cube with 0.7 billion d.o.f. for the largest computation (heterogeneous Neo-Hooke 
material model); using Newton-Krylov-BDDC or Nonlinear-BDDC; see [60, Section 4.4, Fig. 15] 
for details. For 262,144 subdomains Newton-Krylov-BDDC did not converge within 20 Newton 
iterations. Right: Weak scaling of linear BDDC solver with approximate coarse solve (using AMG) 
on JUQUEEN (BG/Q) and Theta (KNL) supercomputers for a heterogeneous linear elastic model 
problem in two dimensions with 14k d.o.f. per subdomain; see [60, Fig. 7] for details 


6.2 Nonlinear BDDC Methods 


Using the same elimination set as in Nonlinear-FETI-DP-4, the nonlinear BDDC 
method [51] can be derived, which is based on its linear version; see [20, 24, 71- 
73]. We presented an efficient and scalable implementation of linear and nonlinear 
BDDC avoiding the computation of Schur-complements in [60]. This approach 
proved to be faster, more scalable, and more robust for nonlinear hyperelasticity 
problems (see Fig. 18 (left)) as well as for elasto-plasticity problems using realistic 
RVE microstructures obtained from EBSD measurements; see [60, Table 4.7]. 
We also studied the scalability of the embedded linear BDDC solver on different 
architectures; see Fig. 18 (right). 


7 Multicore Performance Engineering of Sparse Triangular 
Solves in PARDISO Using the Roofline Performance Model 


The PARDISO [12, 23, 69, 96] parallel sparse direct solver is a building block in 
FE2TI. As long as the macroscopic problem is small enough, it can be solved 
directly by PARDISO; if the microscopic problems are of reasonable size, it is 
efficient to use PARDISO concurrently on the microscale problems. For large micro 
and macro problems, the direct solver has to be replaced by linear or nonlinear FETI- 
DP or BDDC domain decomposition solvers. Here, PARDISO is typically applied 
as the subdomain and coarse solver. 
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The PARDISO solver has two phases: factorization and forward/backward 
substitution, with factorization being more time consuming than substitution. 
However, the former is only performed once in a FETI-DP or BDDC domain 
decomposition iterative process, whereas the latter is repeated many times. We 
are in particular interested in the forward and backward solution process of sparse 
direct solvers since they build the computational kernel, e.g., in FETI-DP or BDDC 
methods. FETI-DP and BDDC are known to be highly parallelizable, but most 
implementations are using sparse direct solvers as building blocks. More precisely, 
most domain decomposition implementations use sparse direct solves for the local 
subdomain problems to obtain the necessary robustness. Additionally, the coarse 
problem is usually solved directly up to a certain size, but for larger problems the 
coarse solve is often approximated by, e.g., AMG or recursive applications of the 
domain decomposition approach itself. Therefore, we investigate and analyze the 
performance of the forward/backward solution process of PARDISO for the local 
subdomain solves in FETI-DP and BDDC and present not only numerical results, 
but also a detailed performance analysis for a representative sparse solver kernel 
based on the roofline model. The goal is to create an analysis of this part of the 
algorithm and to establish a roofline performance model [98], which considers 
performance bounds given by the memory bandwidth and the processor peak 
performance. We modeled both the serial and parallel execution phases. Despite 
the fact that the roofline model prediction can be inaccurate in the serial case, 
when the in-core execution or in-cache transfers become dominant, it still provides 
an easily obtainable upper bound. The simple roofline model brings together the 
computational capabilities of the processor and the available memory bandwidth 
with the requirements of an algorithm. In our case the relevant quantities are 
the number of FLOPs performed and the data transferred between processor and 
memory, which we determined by an analysis of the forward/backward substitution. 

The performance of the forward and backward substitution process is analyzed 
and benchmarked for a representative set of sparse matrices on modern x86-type 
multicore architectures. The characteristic quantities of these systems are shown in 
Table 5 along with the required machine specific input parameters (lower part of 
Table 5) for the roofline model. The measurement approach, its validation, as well 
as limitations are discussed in [98]. Our modeling approach covered both the serial 
and parallel execution phases, allowing for in-socket performance predictions. The 
performance modeling results for a discretized Laplacian matrix (‘lap]2’) and a local 
subdomain matrix (‘BDDC’) from the EXASTEEL project are shown in Fig. 19; see 
also Table 6 for dimensions and numbers of nonzeros for the considered matrices. 
The latter matrix is used in FETI-DP as well as BDDC methods to compute 
the discrete harmonic extension from the domain decomposition interface to the 
interior of a certain subdomain. The specific problem represents a typical RVE 
using the Jz elasto-plasticity model including the material parameters also used in 
the computations of the FLD. We verified that the considered subdomain already 
showed a plastic behavior. 

As opposed to the original roofline model, the modified roofline model covers 
also the performance impact of limited scalability imposed by the algorithm, i.e., 
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Table 5 Details of the Intel and AMD hardware systems evaluated 


Name SKX 

Processor name Intel Xeon 
Gold 6148 

Micro architecture Skylake 

Frequency [GHz] 2.4 

Number of cores aoo o o ooo 20 

Vector instruction set T AVX-512 


NUMA LDs 
Scalar read bandwidth 

1 core [GB/s] 14.5 19.3 
NUMA LD [GB/s] 108.0 37.6 
Scalar ADD+MUL/FMA 

T core [Fey 2o p p p 
NUMA LD [F/cy] ee 
Scalar machine balance By, 

1 core [B/F] 1.5 2.1 
NUMA LD [B/F] o fos jos |o 


The NUMA Locality Domain (LD) refers to a number of cores sharing a physical or logical 
memory controller 


Fig. 19 Performance on V 15 TPARDISO 
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Table 6 Dimension (n) and Matrix |n anz(A) nnz(L) 

number of nonzeros (nnz) for 3 7 z 
Aand L for benchmark lapl2 343 x 10 1x10 166 x 10 
matrices BDDC |750 x 10° |31 x 106 | 1590 x 10° 


both serial and parallel execution phases of the forward and backward substitution 
are considered in the model; see [98] for details. It captures the behavior of the 
measured performance quite well compared to the original roofline model. 

For Intel Ivy Bridge systems the modified roofline model error is only up to 
5%. Further details are given in [98]. During the second phase of EXASTEEL 
the close collaboration with ESSEX-II in the context of performance engineering 
was also extended to iterative solvers, leading to a new promising recursive 
algebraic coloring scheme [1]. The benefits of the recursive algebraic coloring 
were demonstrated by applying it to the kernel operation of a symmetric sparse 
matrix-vector multiplication (SymmSpMV) on various multicore architectures. 
The performance was compared against standard multicoloring and various other 
algebraic block multicoloring methods. The coloring implementation resulted in an 
average and maximum speedup of 1.4 and 2, respectively. Our entire experimental 
and performance analysis process was also backed by the roofline performance 
model, corroborating the optimality of the approach in terms of resource utilization 
of the SymmSpMV kernel on modern hardware; see the ESSEX-II report in this 
book for details. 


8 Improvement of the Mechanical Model for Forming 
Simulations 


In this section, we describe improvements to the modeling developed in the project. 
Not all of the techniques described here are currently used in our FE2TI production 
simulations, mainly to reduce computational cost. 

As mentioned earlier, the favorable macroscopic properties are to a large extent 
due to the heterogeneous microstructural features of the DP steels. A sophisti- 
cated heat treatment process is used to produce a heterogeneous microstructure 
with ferritic matrix and martensitic inclusions. This process is also accompanied 
by several effects which then in conjunction with the difference in mechanical 
properties of ferrite (soft phase) and martensite (hard phase) generate advantageous 
macroscopic properties. In this project, an initial volumetric strains approach, c.f. 
[14], was developed to mimic the micromechanical features resulting from the phase 
transformations during the production process. 

The high yield and work-hardening properties of DP steels, on the other hand, 
pose a problem in terms of forming complex geometries and designing the metal 
working tools. One of the major issues associated with the forming of DP steels, is 
the large springback observed at the end of the forming process, which results in 
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undesirable geometries of the formed parts. Here, simulating the forming process 
with an accurate material behavior can help to predict springback precisely and 
further save valuable resources while optimizing the tooling parameters for the 
process. The springback behavior is found to be very closely associated to the 
Bauschinger factor of the material. Therefore, within this project a multiscale 
modeling strategy to effectively model the DP steel response under cyclic loading 
was developed. In this context, an efficient neural network based algorithm is 
employed to identify the associated microstructural material parameters, leading 
to a reduction in the required computational resources. In order to obtain further 
understanding of the correlation of the model parameters on the macroscopic 
behavior of DP steels during cyclic loading, uncertainty quantification studies have 
been carried out using the developed mechanical models. 

Due to their higher accuracy and physical interpretability, crystal plasticity 
formulations may be used at the RVE level to directly describe plasticity in a 
polycrystal such as multiphase steels. Due to the fact that such formulations are 
computationally highly expensive, they may be primarily applied to computationally 
identify macroscopic yield surfaces. FE? simulations of metal forming processes 
based on such formulations at the small scale will however hardly be feasible. There- 
fore, as part of this project one goal was to improve the quality of micromechanical 
models to be used efficiently in the context of FE”. The associated micromechanical 
simulations are mainly making use of a classical finite J2 elasto-plasticity material 
model, c.f. [79, 87, 89, 90], which is used to model the micro-constituents (ferrite 
and martensite) by defining the hardening law 


BP? = yoo + (MO? — Yoo exp na) + har (5) 
Herein, ye is the initial yield stress, ye is the saturation yield stress, ni% is the 
exponent, /'°° is the linear hardening at saturation yield stress and a is the equivalent 
plastic strain variable. The material parameters of the models are calibrated based on 
uniaxial tests performed on the pure individual constituents martensite and ferrite. 
As representative microstructure a so-called statistically similar RVE was identified; 
see [14]. More information on SSRVEs can be obtained in [7, 8], and [83]. Although 
the individual phases can be represented accurately and the microstructure is 
reflected with high accuracy, still the experimental stress-strain response cannot be 
obtained. The main reason is that distributed properties in the ferritic matrix phase, 
which however result from the production process, were not yet taken into account. 
In addition to that, when focusing on cyclic loading protocols, the macroscopic 
kinematic hardening of the real sheet metal cannot be represented. As a suitable 
quantitative measure for the kinematic hardening, the so-called Bauschinger factor 
can be computed as in = (Pil — Py)/|Pil. Herein, P| = P, (Al, /Lx.0 = —0.05) 
and Py = P (Alk [oP = 0) + 0.002), where P is the 1st Piola-Kirchoff stress 
tensor. Although the Bauschinger factor of the FE? simulation with PS = 0.47 is 
interestingly high considering that for the individual phases no kinematic hardening 
is taken into account up to here, it does not agree well with the experimental value 
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FF = 0.66. Therefore, in this project three major improvements were developed to 
enhance the model quality for FE? simulations of sheet metal forming problems: (i) 
A mixed isotropic-kinematic hardening model was implemented at the microscale 
for the ferrite phase, (ii) an initial volumetric strains approach was developed to 
model the locally distributed plastic properties in the ferrite phase, and (iii) an 
implicit fit procedure was constructed based on a neural network to identify the 
kinematic hardening parameters. 

A mixed hardening model was implemented for the ferritic phase, which consists 
of an exponential isotropic hardening law, see Eq. (5) and a linear kinematic 
hardening law, c.f. [89]. The yield criterion and the evolution of the back stress & 
are then given by 


devo —& 


——. 6 
||devo — §]| i 


: Dw. ‘ 2. ; 
™* = |\devo — &|| — [2e and £= wa 


Here, H is an additional material parameter, signifying the linear kinematic 
hardening. Thus, the material parameters associated with the ferritic phase should be 
newly identified for the mixed hardening material model. An appropriate multiscale 
approach has been developed which is described in Sect. 8.2. 


8.1 Initial Volumetric Strains (IVS) Approach 


The IVS approach proposed in [14] allows the modeling of heterogeneous yield 
stress distribution in ferrite and results in a good agreement of the predicted stress 
values with the experiments. Here, the ferritic yield curve is locally modified using 
modification factor y (X) € [1, 1.6], quantified based on physical and experimental 
observations. As a result of this continuous procedure, where the microstructure is 
subjected to first IVS, followed by subsequent mechanical loading, for, e.g., uniaxial 
tension, not only the distributed properties are obtained but also the eigenstresses 
related with the volume expansion of the inclusion phase can be modeled. However, 
in the context of FE? simulations this procedure is rather expensive since the 
application of the volumetric strains has to be simulated at each point before the 
actual loading can be applied. Due to the fact that the above-mentioned eigenstresses 
are not significant to the macroscopic stresses under loading, here a separated 
approach is proposed: the first step of applying IVS is considered only to generate 
the local ferritic yield modification factors which are saved independently of any 
potential subsequent loading. Then, in the second step of mechanical loading these 
modification factors are applied to the undeformed microstructure. The main benefit 
is the reduction of computing time since the IVS has to be performed only once to 
one microstructure. On the other hand, the eigenstresses obtained from the volume 
expansion are not included anymore. Note that these eigenstresses are usually 
removed from the DP steel sheet by a special heat treatment procedure which is why 
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Fig. 20 (a) Illustrations of 
the steps involved in the 
(modified) initial volumetric 
strains approach, (b) 
comparison of macroscopic 
stress-strain curves for FE? 
uniaxial tension calculations 
for simplified microstructure 
with spherical inclusion: (1) 
modified IVS approach, (ii) 
IVS approach [14], (iii) no 
IVS approach and (iv) 
experiment 
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the absence of these eigenstresses in the numerical simulation may even be more 
realistic. The scheme is illustrated in Fig. 20a. Furthermore, the macroscopic stress 
strain curves obtained under uniaxial tension for various IVS considerations are 
compared against the experiment in Fig. 20b. Here, it can be seen that the proposed 
modified (separated) IVS scheme performs equivalently to the continuous IVS as 
given in [14]. However, as seen in Fig. 20b, not using the IVS approach yields a 
poor accuracy in representing the experimental curve. 

Additionally, it is observed that the choice and extent of ferritic hardening has no 
effect on the resulting factors y and that the same set can be applied in mechanical 
loading computations as long as (i) the microstructure, (ii) the amount of martensitic 
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volume jump considered (i.e. 4%) and (3) the initial yield stress of ferrite QE > 
remain unchanged. 

The IVS approach has been implemented in FE2TI but has not been applied in 
the production runs in Sect. 5. 


8.2 Parameter Identification Approach for Ferritic Mixed 
Hardening 


The incorporation of the mixed hardening in ferrite along with the initial volumetric 
strains approach necessitates the identification of a new set of material parameters 
for ferrite, i.e. y=, 78° and H*!®, Here, ye and h'S° are assumed to be constant. 
Since no cyclic stress-strain data is available for the pure ferrite, the ferrite properties 
need to be adjusted such that the macroscopic response matches well the experiment. 
Due to the micro-macro nature of the computations required for the resulting 
inverse problem, this parameter identification problem becomes highly time and 
computation intensive. Therefore, to accelerate the process a neural network based 
algorithm is proposed. 

As illustrated in Fig. 21, a sufficiently trained neural network takes in eight input 
values from DP steel experiments and outputs the values for the three parameters to 
be identified. These input values are as illustrated in Fig. 21b, i.e. seven macroscopic 
stress values and the Bauschinger factor. The neural network consists of one hidden 


Hidden Output 


-0.05 -0.04 -0.03 -0.02 -0.01 0 


ae 


Fig. 21 (a) Schematic representation of the neural network with the respective activation functions 
at various layers, (b) the input values for the neural network—7 macroscopic stress-strain values 
and | macroscopic Bauschinger factor 
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Table 7 Neural network | grin iso iso 
Da | n Yoo 
training data range for the i [0921 GPa 150250 |0.35-0.80 GP 
identification of ferritic Range 9-2. a Ss : : £ 
Nos. | 5 |4 4 


material parameters 


Table 8 (a) Target material parameters for the ferritic phase identified with the trained neural 
network and (b) Bauschinger factor computed with mixed hardening in ferrite (Sim-mix) 


(a) Eg qi ye (b) [Exp  [Sim-mix 
[1.71 GPa [25.6 [0.375 GPa | fe (0.66  |0.62 


0 
T d +. + 
a 02 S i 
U -0.4 0.4 l , : 
| a -0.6 -0.6 fi 7 
0.8 -0.8 
-0.03 -0.02 -0.01 0 -0.03 -0.02 -0.01 0 
Alx/1x.9 Alx/1x.0 


Fig. 22 Stress response comparison between simulation model with mixed hardening for ferrite 
and the experiment 


layer with tanh type activation functions and one output layer with linear activation 
functions. Results from 80 simulations with the choice of target parameters in the 
ranges mentioned in Table 7 using the simplified microstructure (spherical inclusion 
in a cuboidal matrix) are used as training data. These simulations with different 
target parameter combinations can all be executed at once on many core machines 
to accelerate the process of gathering training data. 

Additionally, a good choice of training range helps to ensure a robust prediction 
of the target parameters. The parameters identified by evaluating this algorithm 
are given in Table 8a. The macroscopic stresses achieved during compression and 
the overall Bauschinger factor obtained with these parameters are in Fig. 22 and 
Table 8b, respectively. These indicate a good match with the experimental obser- 
vations. Additionally, it is found that the identified material parameters predict a 
higher pure ferritic yield curve than observed in experiments on the lab-synthesized 
pure ferrite. It is emphasized that it is in principle challenging to synthesize a pure 
ferrite which corresponds to the ferrite in the DP steel with respect to similarity in 
chemical composition and grain size distribution. Therefore, the experimental data 
regarding the pure ferritic behavior should generally be only considered carefully. 


8.3 Quantification of Uncertain Material Properties 


The material parameters for the micro-constituents of the DP steel are usually 
obtained based on experiments on limited numbers of samples. Since the material 
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behavior of the constituents depends on the production process parameters, which 
may be non-uniform due to the nature of the process over large batch sizes, the 
measurements might not accurately represent the complete reality. This holds in 
particular for specialized laboratory productions of samples only consisting of the 
pure ferrite phase, which matches the microstructure and chemical composition as 
closely as possible compared to the ferrite in the DP steel. Due to the fact that 
these ferrite properties are however believed to strongly influence the properties 
of the overall DP steel behavior an associated uncertainty quantification analysis 
was conducted as part of the project. Based on such analysis the sensitivity of the 
macroscopic stress-strain response of DP steel for modified ferritic properties can 
be investigated. For the analysis employed here, known probability distributions are 


assumed for selected ferritic parameters which are (i) ge and yee and (ii) H*™. It 


should be noted that varying yio and ye together for ferrite leads to a change in the 
height of the ferritic yield curve. For each of the cases, 15,000 samples are randomly 
constructed to generate Gaussian distributions as input uncertainty distributions for 
the ferrite parameters ye and HY"; see Fig. 23. 

Based on these assumed input distributions of the ferrite parameters the resulting 
distributions of macroscopic properties are to be computed. Trained neural networks 
are used here again to evaluate each of these samples and to compute the macro- 
scopic DP steel responses for (i) the yield stress in compression (R p0.25), (ii) the 
Bauschinger factor (fp) and (iii) the hardening modulus around 5% compression 
(H end). 

The output uncertainties in the above mentioned macroscopic measures are 
then plotted in the sense of their co-relation with the ferritic yield curve and the 
prescribed kinematic hardening parameter in Figs.24 and 25 respectively. The 
correlation between the output macroscopic initial yield stress (Rp0.25) and the 
ferritic yield curve as seen in Fig. 24a for the prescribed input, turns out to be a linear 
relationship. As evident in Fig.24b the macroscopic Bauschinger factor, changes 
non-linearly with the ferritic yield curve. However, the overall small variations in 
the values of the Bauschinger factor suggest that the height of the ferritic yield curve 
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Fig. 23 Input uncertainty distributions for (a) variation in yise and (b) H ferritic material 
parameters 


EXASTEEL: Virtual Laboratory 391 


R yo.25 [GPa] B H ena [GPa] 
0.628 


T 
+ simulated sample + simulated sample 
fit result 


fit result 0.627 


0.626 


0.625 


0.624 + simulated sample 


fit result 


-0.45 
0.32 0.34 0.36 0.38 0.4 0.42 0.32 0.34 0.36 0.38 0.4 0.42 0.32 0.34 0.36 0.38 0.4 0.42 
yë? [GPa] (a) yë? [GPa] (b) yo? [GPa] (c) 


Fig. 24 (a) Output uncertainty of (a) Rpo.25 stress, (b) tp and (c) Hena based on the variation of 
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Fig. 25 (a) Output uncertainty of (a) Rpo.25 stress, (b) fa and (c) Heng based on the variation of 
H 


only negligibly influences the macroscopic Bauschinger factor. Thus, it appears that 
it is the overall large difference in yield stress between the ferrite and the martensite 
rather than moderate changes within the ferrite itself, which is responsible for 
the relatively large kinematic hardening of DP steel. Fig. 24c indicates a linear 
relationship of Hena with the ferritic yield curve height. Again, the values of 
macroscopic moduli change only insignificantly, which indicates a small sensitivity 
of macroscopic response due to modifications of the ferrite yield stress. 

Now, the influence of the kinematic hardening is investigated. The variation in 
the above mentioned macroscopic quantities has been considered for a prescribed 
uncertainty in the linear kinematic hardening modulus H*™ of the pure ferrite phase. 
The results for the macroscopic initial yield stress Rpo.25 are plotted in Fig. 25a, 
where a linearly reducing correlation is observed with increasing ferritic H*". As 
before, the values indicate a negligible change in Rpo.25 with modifications of H., 
This changes significantly for the macroscopic Bauschinger factor, see the results in 
Fig. 25b, which indicate a strong influence of the macroscopic Bauschinger factor 
due to modifications of the ferritic kinematic hardening modulus. The relationship 
is a linearly increasing one. Likewise, an also rather significant, linearly increasing 
correlation is observed between Heng and H*", which has been plotted in Fig. 25c. 
The results indicate that while the variation in the height of the ferritic yield curve 
results in a considerable effect on the macroscopic initial yield stress and hardening 
modulus, it only negligibly influences the Bauschinger factor. Whereas, the ferritic 
kinematic hardening has a strong influence on the macroscopic response of the DP 
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steel, especially the macroscopic Bauschinger factor. This further highlights the 
necessity of employing a mixed hardening based ferritic material model for the 
micro-macro simulation of relevant DP steel forming problems where effects such 
as spring-back are of major importance. 

Especially for uncertainty quantification problems, where the variation in the 
microstructure’s morphology is considered as source of uncertainty, a high number 
of different statistically similar volume elements (SSVEs) needs to be simulated. 
For this purpose, an optimal decomposition approach in the context of a finite cell 
integration scheme was developed in this project, see [26]. This approach allows for 
an automated calculation without the necessity to construct a new mesh for each 
SSVE while keeping the overall computing time even lower than for a conforming 
(standard) mesh. 


8.4 Crystal Plasticity 


A better description of certain phenomena, e.g., localization, in crystalline materials 
can be achieved by explicitly modeling the polycrystalline structure of the material. 

Such materials consist of various single crystals with different orientations 
which interact through the granular interfaces. By directly modeling the plastic 
behavior of these single crystals, anisotropic yield and complex flow behavior can 
be captured directly. As pointed out in Sect. 8, this would lead to computationally 
highly expensive simulations, which can be overcome using approximations of the 
response of the underlying polycrystal. However, to illustrate the procedure and 
complexity of incorporating polycrystalline microstructures directly into multiscale 
simulations, a single crystal plasticity model for face-centered cubic (fcc) crystals at 
small strains has been implemented, considering an additive decomposition of the 
small strain tensor into elastic and plastic part e = e° + eP where é? = >>, pê P? 
directly connects the inelastic behavior in the individual grains to the inherent 
crystallographic structure through the dependency of the rate of plastic strain on the 
projected rate of plastic slip 7° summed over all systems 5. Therein, the projection 
tensor Pê = sym (sê Q n°) is defined based on the orthonormal vectors sê _L nô, 
describing the slip system ô. Single crystal plasticity models can be distinguished 
into rate-independent and rate-dependent models. Algorithms of the former type 
are typically governed by the issue of non-uniqueness of choice of active slip 
systems among all possible ones, [3, 16], which adds to the complexity of the 
material model. Different approaches exist to handle this intrinsic problem by e.g. 
simple perturbation techniques [74], augmented Lagrangian methods [85] or penalty 
approaches. Recently, an alternative approach to handle the issue of non-uniqueness 
among the activity of slip systems has been proposed in [84], which uses Infeasible 
Primal-Dual Interior Point methods for solving the constraint optimization problem. 
This method uses barrier functions combined with the given constraints of the 
problem in order to penalize the approach of the boundary of the feasible domain. In 
contrast to that, rate-dependent algorithms consider all slip systems to be active and 
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link the rate of slip 7° on each system 6 directly to the Schmid stress tê = ø : P®. 
The kinetic law reads 


7 p-l a) 
= (=) with g = >on? |p|, (7) 


ô 
B 


as, e.g., proposed by [39], where the hardening moduli h°! depends on the strain- 
like internal variable A with A = >; |7°]. 


8.5 Macroscopic Yield Surface Based on Polycrystalline RVEs 


In this section, we use two-scale simulations using crystal plasticity to compute 
macroscopic yield surfaces. These yield surfaces can then be used in FE2TI 
simulations without directly incorporating crystal plasticity. 

The influence of the microscopic polycrystalline material can be considered 
to compute the resulting macroscopic anisotropic yield surfaces, as mentioned 
in Sect. 8 and included in a hierarchical multiscale approach, see [32]. In the 
following, a microstructure consisting of a polycrystal with multiple grains is 
considered to model its macroscopic yield behavior. Here, for the computation of 
macroscopic yield surfaces based on the microscopic behavior of polycrystalline 
unit cells, the software Neper is used to generate a periodic unit cell with 15 grains. 
The geometry is meshed using 10-noded tetrahedral finite elements. In order to 
account for an isotropic orientation distribution of the polycrystalline unit cell, each 
grain is assigned to a specific orientation following from a geodesic dome. For 
details, we refer to [82]. With these unit cells, macroscopic yield curves based on 
macroscopic biaxial loading paths, i.e. 0) : 62,03 = 0, are computed in an FE? 
scheme. The stress-driven simulation requires small time steps, which is amplified 
by the small time step size required for the rate-dependent formulation of single 
crystal plasticity. Figure 26 shows the initial yield surface at (œ) = 3.3- 1078 as well 
as the distribution of œ inside the unit cell. Since the rate-dependent formulation 
does not have a distinct yield point and the rate-independent behavior is here 
modeled with p = 200, see Eq. (7), this value of equivalent plastic strains has been 
arbitrarily chosen by the authors. The evolved macroscopic yield surface based on 
a polycrystalline unit cell at (œ) = 4.7 - 1074 and a respective distribution of œ is 
shown in Fig. 26. As pointed out in [11], the initial yield surface forms the shape of 
a Tresca-type yield criterion, whereas the further evolved yield surface is of typical 
elliptical Mises-type. 
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Fig. 26 Polycrystal with 15 grains. Distribution of equivalent plastic strains œ for loading path 
Ti : 02 =—1: Oat af = (a) = 3.3- 1078 (left) and for loading patho, : 02 = 0: lat 
ay = (a) = 4.7- 1074 (center). Right: Associated successive yield surfaces; Tresca-type for aï 
and von Mises type for a} 


8.6 One-Way Coupled Simulation of Deep-Drawing Using 
Polycrystalline Unit Cells 


Finally, in this section, we demonstrate a two-scale simulation directly incorporating 
crystal plasticity on the micro scale. Such simulations are computationally expen- 
sive. Only a one-way coupling is used here, and J2 elasto-plasticity is applied on the 
macroscale. 

In the following, a sheet metal forming process of deep-drawing of a hat-profile, 
adopted from [7], using an Al-Cu alloy is simulated under consideration of the 
polycrystalline microstructure in a one-way coupled FE scheme. In Fig. 27, the 
finite element mesh (165 linear quadrilateral elements) is shown. The interaction 
between the sheet and the tools is realized with a frictionless penalty contact 
formulation. The macroscale simulation is carried out using a finite J2 elasto- 
plasticity model with isotropic von Mises yield behavior based on an algorithmic 
setting by [65]. The material parameters, cf. Eq. (5), were fitted to a macroscopic 
uniaxial tension test with the EE talline unit cell used on the microscale leading 
to KACO — 50,754 N/mm?, ^C! = 23,425 N/mm’, yf" = 125 N/mm?, 


yAlCu — 160 N/mm?, JAC! = 750 and WY" = 1 Niram, The final state of 
the aicet: forming simulation is depicted in Fig. 28 and the distribution of equivalent 
plastic strain is shown. Throughout the simulation, the deformation gradient F is 
captured at three different positions, marked by O, A and © therein, at the top, 
center and bottom of the sheet, respectively, leading to nine evaluation points in 
total. 

The recorded deformation is applied to a polycrystalline unit cell in a one-way FE 
coupling. The single crystal plasticity computation is performed at small strains, as 
described in Sect. 8.4. Thereby, the applied material parameters are taken from [92] 
with Lamé constant A = 35,104.88 N/mm?, shear modulus u = 23,427.25 N/mm’, 
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Fig. 27 Discretization of the macroscopic BVP of the deep-drawing of a hat profile under plane 
strain conditions. 1: drawing die, 2: blank holder, 3: sheet metal (discretized with 5 x 33 elements), 
4: punch, 5: punch radius of 7mm, and 6: die radius of 6mm are used. The contact definitions 
between punch, drawing die, blank holder, and sheet metal are realized using a frictionless penalty 
formulation. The analyzed RVEs in the macroscopic BVP are located near the punch radius, O, 
in the vertical section, ®©., and near the die radius, ©, according to the final deformation, cf. 28 
(left). The drawing depth of the hat profile is 45.7 mm with a sheet half width of 100 mm and a 
thickness of 1.4 mm 
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Fig. 28 Distribution of @ in sheet metal at final deformed state (t = 50) and strain-like internal 
variable A in polycrystalline unit cells at different points 
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initial slip resistance tọ = 60.84 N/mm, saturation stress Toy = 109.51 N/mm, 
initial hardening modulus ho = 541.48 N/mm?, material rate sensitivity parameter 
p = 200, and reference slip rate 7 = 1 - 107°. The small strain tensor @ is used to 
transfer the deformation state from the macroscale to the microscale, however, no 
coupling back from micro- to macroscale is considered. 

In Fig. 28, the distribution of equivalent plastic strain @ in the hat-profile and 
the distribution of the strain-like internal variable A as a result of the evaluation 
of the one-way coupled polycrystalline unit cells is shown. Differences between the 
positions of the polycrystals in the sheet are obvious as well as the nonhomogeneous 
distribution of A. 


9 Conclusion 


The vision of the EXASTEEL project is to develop a virtual HPC laboratory 
allowing for predictive virtual material testing of modern steels. On this path, we 
have moved forward in several directions: Since the properties of modern dual-phase 
steels largely stem from their microstructure, homogenization is indispensable to 
achieve our goals. We therefore have developed and implemented the FE2TI library, 
a highly scalable software for computational homogenization based on the FE? 
approach (Sects. 2—4). This approach was then used, for the first time, to compute a 
forming limit diagram for DP600 steel using the JUWELS supercomputer (Sect. 5). 
Let us remark that the computation of an FLD is already a step beyond the 
achievements envisaged in the original EXASTEEL-2 proposal. We have also 
shown scalability of the FE2TI package up to the largest supercomputers currently 
available, e.g., using more than one million MPI ranks for nonlinear production 
problems, i.e., using unstructured meshes, elasto-plasticity, and full parallel I/O [64] 
(Sect. 3). These latter simulations use parallel FETI-DP solvers for the RVE 
problems and made use of the full JUQUEEN supercomputer. 

To move towards full use of the future exascale supercomputers, we have worked 
on extending the parallel scalability of implicit nonlinear FETI-DP and BDDC 
domain decomposition solvers (Sect. 6). Scalability to 800,000 parallel tasks was 
achieved for our nonlinear solvers [54], outside of our parallel FE? multiscale 
context; see Fig. 15. These simulations used the full Mira supercomputer. We have 
also considered techniques to improve the energy efficiency of our nonlinear domain 
decomposition solvers (Sect. 6.1.1). Careful performance analysis and engineering 
was applied to the FE2TI software building blocks, e.g., for the performance 
engineering of the sparse triangular solves of the PARDISO sparse direct solver 
(Sect. 7). 

For the modeling, considering initial volumetric strains, resulting from the 
complex steel production process, has shown to be of interest; therefore, an 
efficient algorithmic approach to IVS was proposed (Sects. 8.1 and 8.2). This IVS 
approach has been implemented in FE2TI. Further improvements in our modeling 
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may be achieved by incorporating effects from crystal plasticity (Sect. 8.4). An 
approach to fit macroscopic yield surfaces to crystal plasticity simulations was 
presented (Sect. 8.5). The resulting yield surfaces can be used in FE2TI without 
using an explicit coupling with crystal plasticity simulations. However, we have 
also demonstrated a two-scale simulation using a one-way coupling with crystal 
plasticity (Sect. 8.6). 

For the quantitatively predictive simulations envisaged in this project, several 
improvements are planned for the future. First, realistic material models reproducing 
the physics on the microscale are important. Different advanced approaches beyond 
the ones considered so far may be of interest, e.g., based on the techniques described 
in Sect. 8. Second, for the computation of the FLD, the exploitation of the symmetry 
of the Nakajima specimen has to be reviewed and, especially for strongly anisotropic 
microstructures, simulations using the full geometry have to be performed for all 
specimen. Third, a validation with experiments for steels other than DP600 will be 
necessary. Finally, once exascale supercomputers will be available, predictive virtual 
steel simulations at the exascale will leverage the combined parallelism of the FE? 
algorithm and of the parallel nonlinear domain decomposition solvers. 
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behind it and several case studies that demonstrate its feasibility and versatility: 
high-performance stencil codes can be engineered, ported and optimized more 
easily and effectively. 


1 Overview of ExaStencils 


1.1 Project Vision 


ExaStencils! takes a revolutionary, rather than evolutionary, approach to software 
engineering for high performance. It seeks to provide a proof of concept that 
programming can be simplified considerably and that program optimization can be 
made much more effective by concentrating on a limited application domain. In the 
case of ExaStencils, it is a subdomain of geometric multigrid algorithms [84]. The 
idea is to write a program in a domain-specific programming language (DSL). In our 
case, it is an external DSL (i.e., a DSL built from scratch rather than embedded in 
an existing language), called ExaSlang [72]. ExaSlang is stratified into four layers 
of abstraction. Each layer provides a different view of the problem solution and 
can be enriched with information to allow for particular optimizations at that layer. 
ExaSlang’s most abstract layer specifies the problem as a set of partial differential 
equations (PDEs) defined on a continuous domain. The most concrete layer allows 
the user to specify low-level details for an efficient implementation on the execution 
platform at hand. Ideally, the domain expert should only be dealing with the first 
layer (plus some menu-driven options). The ExaStencils code generator should 
be able to generate all lower code layers while applying a set of optimizations 
autonomously before producing efficient target code. 


1.2 Project Results 


This subsection summarizes the challenges that drove the development of Exa- 
Stencils and how far we got in the period of SPPEXA funding. 

We begin with the delineation of the domain and the mathematical challenges, 
distinctly from the computer science challenges, that ExaStencils addressed (see 
Sect. 2). We restricted our attention to the development of smoothers for geometric 
multigrid methods and the required analysis tools. In this domain, local Fourier 
analysis (LFA) is the method of choice to analyze the developed smoothers and 
the entire multigrid method. To make LFA useful for our purpose, we extended the 
method to periodic stencils, covering block smoothers and varying coefficients [10]. 
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The first major computer science challenge was to cover, with one single source 
program, a wider range of multigrid solvers than is possible with contemporary 
implementations based on general-purpose host languages such as Fortran or C++ 
(see Sect. 3). To this end, we decided not to build ExaSlang on an existing, general- 
purpose host language but to make it an external DSL. We were able to demonstrate 
its flexibility already early on in the project by providing a common ExaSlang 
program for Jacobi, Gauss-Seidel, and red-black solvers for finite differences and 
finite volumes with constant and linear interpolation and with restriction [53]. 
Later on, we introduced a way to specify data layout transformations simply by a 
linear expression—a feature that can aid the development process significantly [46] 
(see Sects. 3.1-3.2). A smaller task was to decide how to describe aspects of the 
execution platform in a platform-description language (TPDL) [75] (see Sect. 3.3). 

The second major computer science challenge was to reach high performance 
with our approach to code generation on a wide range of architectures (see Sect. 4). 
One important aspect here is what information to provide at which layer (see 
Sect. 4.1). While the syntax of ExaSlang is partly inspired by Scala, Matlab and 
LaTeX, our target language is C++ with additional features that depend on the 
execution platform (see Sect. 4.2). We demonstrated that, with the help of our 
optimization techniques (see Sects. 4.3-4.5), weak scaling could be achieved on 
traditional cluster architectures such as the JUQUEEN supercomputer at the Jiilich 
Supercomputing Center (JSC) [48, 72]. We also achieved weak scaling on the 
GPU cluster Piz Daint at the Swiss National Supercomputing Centre (CSCS). 
Furthermore, we demonstrated the automatic generation of solvers that can be 
executed on emerging architectures such as ARM [51] and FPGA platforms [73, 77]. 

One new concept that ExaStencils introduced into high-performance computing 
is that of feature-based domain-specific optimization [5] (see Sect. 4.6). The central 
idea is to view a source code, such as a stencil implementation or an application, 
as a member of a program family or software product line rather than as an isolated 
individual, and to describe the source code by its commonalities and variabilities 
with respect to the other family members in terms of features. A feature represents 
a concept of the domain (e.g., a type of smoother or grid) that may be selected 
and combined with others on demand. With this approach, a large search space of 
configuration choices can be reviewed automatically at the level of domain concepts 
and the most performant choices for the application and execution platform at hand 
can be identified. To this end, we devised a framework of sampling and machine 
learning approaches [36, 39, 80] that allow us to derive a performance model of 
a given code that is parameterized in terms of its features. This way, we can 
express performance behavior in terms of concepts of the domain and automatically 
determine optimal configurations that are tailored to the problem at hand, which we 
have demonstrated in the domain of stencil codes [29, 30, 32, 36, 39, 59, 80] and 
beyond (e.g., databases, video encoders, compilers, and compression tools). Our 
framework integrates well with the other parts of ExaStencils that use and gather 
domain and configuration knowledge in different phases. 

Project ExaStencils came with several case studies whose breadth was to 
demonstrate the practicality and flexibility of the approach (see Sect. 5). The case 
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studies are a central deliverable of ExaStencils. They include studies close to real- 
world problems: the simulation of non-Newtonian and non-isothermal fluids (see 
Sect. 5.3) and a molecular dynamics simulation (see Sect. 5.4). 

We conducted two additional studies at the fringes of ExaStencils, exploring 
alternative approaches (see Sect. 6). In one, the option of an internal rather 
than external DSL was explored: ExaSlang 4 was embedded in the mutual host 
languages Java and Ruby to study the trade-off between the effort of the language 
implementation and the performance gain of the target code [17]. The outcome 
was that an embedding is possible but, as expected, with a loss of performance 
(see Sect. 6.1). In the second study, we implemented a simple multigrid solver in 
SPIRAL [11] (see Sect. 6.2). The success of the SPIRAL project [24, 61] a decade 
ago was a strong motivator for project ExaStencils. SPIRAL can handle simple 
algebraic multigrid solvers but would have to be extended for more complex ones. 


1.3 Project Peripherals 


Attempts of abstraction and automation in programming have received increased 
attention in the past two decades in the area of software engineering. High- 
performance computing has been comparatively conservative in going down this 
road. The reason is that the demands on performance are much higher than in 
general software engineering, and the architectures used to achieve it are more 
complex, notably with large numbers of loosely coupled processors. 

The potential of an effective automation grows as the application domain 
shrinks. Promising domains are much smaller than those of any general-purpose 
programming language. The extreme is the compiler FFTW [25] that targets a 
single numerical problem, the fast Fourier transform. As just mentioned, SPIRAL 
widened the domain to linear transforms (and, lately, beyond [81]). By now, quite a 
number of optimizing code generators have been proposed that target stencil codes. 
Patus [18] has a strong focus on autotuning. The strong point of Pochoir [82] is 
the cache obliviousness of its target code. Pochoir addresses constant stencils and 
is based on the C dialect Intel Cilk with limited portability to other platforms. 
Devito [52] performs symbolic computation via SymPy and targets shared-memory 
architectures, just like Snowflake [90]. Firedrake [63] is primarily for finite-element 
methods and has adopted multigrid lately. One recent component of it is Coffee [54], 
which addresses the local assembly in finite-element methods. STELLA [34] and its 
spiritual successor GridTools generate code for stencil computations on distributed 
structured grids for climate and weather modeling. 

For the domain of image processing, many approaches based on code generation 
exist that are conceptually similar to the idea behind project ExaStencils. Notable 
projects include Halide [62], HIPA® [57], PolyMage [58], and Mint [85]. The 
specification of image filter kernels is related to the concept of stencils, and 
many of the basic parallelization techniques are comparable. However, ExaSlang 
is fundamentally based on computational domains that are multidimensional, as 


ExaStencils: Advanced Multigrid Solver Generation 409 


opposed to the usually two-dimensional data structures used to represent images. 
Image processing DSLs usually target shared-memory parallel systems, i.e., single 
compute nodes such as multi-core CPUs or a single GPU. On the other hand, 
ExaStencils aims at the domain of HPC and, consequently, supports distributed- 
memory systems and respective parallelization techniques. 


2 The Domain of Multigrid Stencil Codes 


2.1 Multigrid 


The goal of the ExaStencils project has been the automatic generation of efficient 
stencil codes for geometric multigrid solvers [84] on (semi-)structured grids. 
Multigrid methods form a class of iterative solvers for large and sparse linear 
systems. These methods have originally been developed in the context of the 
numerical solution of partial differential equations, which often involves the solution 
of such linear systems. 

Multigrid methods combine two complementary processes: a smoothing process 
and a coarse-grid correction. While each of the two processes is by itself not 
sufficient to solve the problem efficiently, their combination yields a fast solver. 

The smoothing process is usually a straightforward iterative method that con- 
verges rather slowly when used on its own. Combining such a process with a 
coarse-grid correction accelerates the convergence of the resulting method by 
augmenting the iteration with information obtained on a coarser grid. Since the 
grid determines the resolution at which the solution is being computed, a multigrid 
method considers the problem at different resolutions. 

A multigrid method performs stencil computations on a hierarchy of fine to 
successively coarser grids. The overall cost of the method can be reduced further by 
applying this idea recursively, i.e., instead of two grids, we consider a hierarchy of 
successively coarser grids. The recursion follows a so-called cycling strategy, e.g., 
a V-cycle or a W-cycle (see Fig. 1). The cycling strategy determines how much 
work is performed at what level of the grid hierarchy, which also has an effect on 
the convergence rate of the method. On the coarser grids, less processing power is 
required. 

In summary, to construct a multigrid method, one must choose a set of compo- 
nents: a smoother, an interpolation, a restriction, a coarse-grid approximation, and 
a cycling strategy. The choice of components influences the number of iterations 
required to obtain an adequate solution, the computational cost per iteration, and 
the communication pattern of the method. Furthermore, the behavior of the method 
depends on the linear system to be solved. 
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Fig. 1 Cycling strategies across the grid hierarchy. The hierarchy levels are denoted by Q7, Q21, 
and Q4, progressing from the finest to the coarsest grid. Light circles solve the coarse-grid 
system directly, and dark circles solve the coarse-grid system recursively. Down arrows symbolize 
restrictions, up arrows interpolations. (a) V-cycle. (b) W-cycle 


2.2 Advancing the Mathematical Analysis of Multigrid 
Methods 


At the start of project ExaStencils stood the search for a way to estimate the number 
of iterations that a multigrid method requires to produce an adequate solution. The 
choice of multigrid components determines the operations that must be performed 
per iteration. Hence, to estimate the time required per iteration, one just needs to 
place these operations in an appropriate order and estimate the duration time of 
their execution. While this is by no means a trivial endeavor, estimating the number 
of iterations needed is inherently different. In this section, we deal with the latter 
problem. 

To estimate the number of iterations, we must consider the mathematical 
properties of a given multigrid method for a given set of components and problem. 
For this purpose, we must determine the contraction properties of the iteration 
operator of the multigrid method for a given configuration. To this end, local 
Fourier analysis (LFA) tells us whether the repeated application of the iteration 
operator lets the error converge to zero. In particular, we are interested in the 
resulting rate of convergence. 

For project ExaStencils, we had to extend the capabilities of LFA. We wanted 
to analyze block smoothers, which have a higher arithmetic density than standard 
smoothers (see Sect. 2.3), and complex problems for which no feasible way of 
applying LFA was known. Furthermore, we aimed at an automation of the analysis. 

At the beginning of project ExaStencils, we considered using and expanding 
an existing LFA software [88]. However, in the course of the project, it became 
quickly evident that we needed a more general approach, something beyond a 
collection of templates that allow to fill in some blank spots but do not facilitate the 
reuse or recombination of components. Since we intended to explore many possible 
combinations of various multigrid components, it was not feasible to program an 
individual LFA by hand for every combination. We developed the LFA Lab [65], 
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which is based on the principle of combining a set of primitive expressions to 
complicated ones, enabling a much more flexible analysis. 

A multigrid method, as well as its components, can be characterized by their 
iteration operators. An iteration operator describes the propagation of the approx- 
imation error during the execution of the method. If we denote the error after the 
k-th iteration with e and the iteration matrix of the method with E, we have the 
following equation: 


oUt) = Eel) 


The spectral radius o(E) and the operator norm ||E|| of E characterize the 
asymptotic and worst-case error reduction factors of the method, respectively. LFA 
determines these quantities in a simplified setting. 

Let h, with O < h e R, be the step size of the infinite d-dimensional grid 
Gn := h- Zf. We consider stencil operators on the space of bounded grid 
functions €2(G;,) := {u : Gr > C | panes |u|> < oo}. A stencil operator 
A: £9(Gp) —> €2(Gp) is a linear operator given by a family {sk}keza, Sk € R, such 
that 


Au (x) := Ý sku + hk) for x € Gpandu € €2(Ga). 
kezd 


Stencil operators have a particularly simple form when transformed via the discrete- 
time Fourier transform. 

The discrete-time Fourier transform (DTFT) Fis a linear isometry that maps the 
space £2(G;,) onto L2(@p) := {u: O, > C | to, |u(x)|* dx < co}, where ©; := 
[0, 2=)¢. In other words, it represents a function on an infinite grid by a function on 
a continuous and bounded interval. A stencil operator A in Fourier space, i.e., the 
operator FAF |, is just the multiplication by a function å : L2(O,) —> C. We call 
the function â the symbol of A. Thus, the symbol of a stencil operator is a function 
that encodes the entire behavior of the infinite-dimensional operator. 

The symbol â of a stencil operator reveals the desired information about the 
operator A. We have that 


p(A) = ess-supgee, |4(9)| and ||A|| = ess-supgee, lâ (0)| , 


where ess-sup denotes the essential supremum. These formulas mean that, to 
compute the spectral radius or the operator norm of a stencil operator, we must just 
compute the largest value of the absolute modulus of its symbol â. Note that, in the 
definition of stencil operators, we have assumed that stencil s does not depend on 
position x. However, it can be useful to consider operators whose stencil is allowed 
to change with the position. 

A stencil that depends arbitrarily on the position has no particularly simple form 
in the Fourier domain. However, we were able to show that periodic stencils do have 
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a simple representation [10, 66]. A periodic stencil depends on the position, but has 
the same entries repeated periodically across the entire domain. While this does not 
represent stencils accurately that are variable in the entire domain, at least some 
variability is reflected in the analysis. We showed that a periodic stencil operator is 
described, after a proper reordering of the frequency domain, by a matrix of ordinary 
symbols—more precisely, by matrix symbols from the space Ly” (Oy) for some 
appropriate positive h’. Furthermore, we were able to show that there is a one-to- 
one relationship between matrix symbols and periodic stencils. 

Using matrix symbols, similar results for the spectral radius and operator norm 
hold. For an operator given by a periodic stencil, we have that 


p(A) = ess-supyco,, ]G(8)|| and Al] = ess-supyco,, P(@()) . 


Thus, to obtain the norm and spectral radius of the operator, we must find the largest 
value of the norm and spectral radius of the matrix a(@). 

The framework of periodic stencils and matrix symbols allows for more advanced 
problems to be analyzed. It also lends itself to automation via software. Operators 
that have a matrix symbol can be combined in many ways such that the combination 
also has a matrix symbol. Thus, we can create a flexible LFA software by using the 
idea of providing first a set of primitive expressions and then means of combination 
and abstraction [1]. 

For example, the iteration operator of the weighted Jacobi method is 


E;=I- DA, 


where A is the system matrix, D the diagonal part of A, I the identity matrix and 
w € R a weighting factor. If we assume that the behavior of A can be modelled 
sufficiently accurately by a (periodic) stencil operator on an infinite grid, we can 
replace each matrix by the infinite-dimensional operator given by the corresponding 
(periodic) stencils to simplify the analysis. Listing 1 shows the computation of the 
spectral radius of the iteration operator of the Jacobi method for the stencil resulting 
from the discretization of the Poisson equation using our software LFA Lab [65]. 

LFA Lab can be used as a simple Python [86] library, but it is more or less an 
embedded DSL for LFA. The user provides the (periodic) stencils, which yield the 
corresponding operators. Then, the user combines these operators with interpolation 
and restriction to an expression describing the desired iteration operator. 


ifrom lfa_lab import * 

ga Ganka, 10/2; 10/321) 
3A = gallery.poisson_2d(g) 

AL operator .identity (g) 


somega = 0.8 
6] I - omega x A.diag().inverse() x A 
print ((E.symbol () .spectral_radius () ) ) 


Listing 1 Implementation of an LFA of the weighted (w = 0.8) Jacobi smoother for the solution 
of the Poisson equation using LFA Lab 
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The code in Listing | is essentially a direct implementation of the formula for 
the iteration operator. However, keep in mind that this formula describes actually a 
combination of infinite-dimensional operators. When calling the symbol method, 
the software determines automatically a way to represent the given iteration operator 
via its Fourier matrix symbol, which is a non-trivial procedure. 

To compute the matrix symbol of an expression given by the user, appropriate 
sampling parameters must be determined. For this purpose, LFA Lab has two stages. 
The first stage extracts and analyzes the expression tree of the formula that the user 
entered. The second stage then samples the matrix symbol to obtain the spectral 
radius and operator norm. The power of the software lies in the fact that arbitrarily 
complex expressions can be analyzed. 

A more complex example is the analysis of the two-grid method. It has the 
iteration operator 


Ero = S (I — PAZ! RA)S, 


where Ae is the coarse-grid operator, P the interpolation, R the restriction, and S 
the iteration operator of the smoother. Assume that we already have an analysis for 
the smoother. If we can express P, R, and A, using Fourier matrix symbols, we 
can combine these with the analysis of the smoother we already have to analyze the 
two-grid method in its entirety. 

In summary, we have constructed a powerful and flexible LFA software. The 
flexibility comes from a small set of primitive expressions and means of combina- 
tion and abstraction. The periodic stencil operators are the primitive expressions, 
mathematical operations are a means of combination, and the Python programming 
language provides the opportunity of abstraction. The software is used to estimate 
the convergence rate of a multigrid method for a given set of components and a 
given problem. The estimate comes as a number of iterations a multigrid method 
needs to achieve adequate accuracy. 


2.3 Advancing Multigrid Components 


The choice of the smoothing component in a multigrid method is not always 
straightforward. Some problems require advanced smoothers. This can be easily 
appreciated when considering that, since the system contains a zero block, a 
pointwise relaxation is not possible [84]. The steady-state Stokes equations can be 
written as follows: 


—Au+Vp=f, in Q 
V-u=0, in Q 
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(a) (b) 


Fig. 2 Unknowns that are included in the blocks of the smoothing steps. (a) Vanka smoother. (b) 
Triad smoother 


for a given domain Q with boundary dQ. Here, u is the vector-valued fluid velocity, 
p is the pressure, and f describes an external force. 

This linear system of PDEs can be discretized on staggered grids or by using 
appropriate finite elements. Here, we consider staggered grids in two dimensions. 
For the Stokes equations, the efficient Vanka smoother has a relatively high 
computational cost and unsatisfactory parallelization properties due to a process 
of overlapping block-smoothing steps (see Fig. 2). This made us consider the Triad 
smoother as an alternative: it provides low computational cost in combination with 
good parallelization properties [19]. 

Both block smoothers are based on a collective updating process of unknowns 
inside one block [87]. As depicted in Fig. 2, the Vanka blocks consist of five 
unknowns including one pressure and two velocity components in each direction 
while the Triad smoother comprises the simultaneous update of three unknowns 
including one unknown of each kind. 

Numerical results for periodic boundary conditions in combination with par- 
allelization properties and computational work show the potential of the Triad 
relaxation method. However, numerical results for the Stokes system with Dirichlet 
boundary conditions show that the Triad method in its original form has one issue: 
the convergence rate deteriorates tremendously. To illustrate this, we applied the 
method to the Stokes equation discretized on a staggered grid in the unit square, 
with periodic and with Dirichlet boundary conditions. As right-hand side, we 
chose in both cases fy, (x, y) = 27? sin(nx) sin(xy) + mcos(zx), fu, (x, y) = 
Qn? cos(x x) cos(ry) — x sin(xy), and fp = 0; the initial guess was zero. Figure 3 
shows the different convergence behaviors. 

The convergence properties of the Triad smoother can be improved with the fol- 
lowing idea: repeat the relaxation process four times while changing the unknowns 
contained in one block after each iteration, i.e., rotating the “L’-shaped pattern that 
describes the block to be relaxed. This algorithm, illustrated in Fig. 4, improves 
the convergence significantly. When applying the four iterations of the proposed 
smoother, it is very similar to one iteration of Vanka. Thus, a smoother that has not 
been considered as an option before becomes a viable alternative to established 
smoothers for systems of PDEs. In addition, the order in which the boxes are 
updated can be varied. For more details and further results, see the dissertation of 
Lisa Claus [19]. 
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Fig. 3 Convergence behaviors of the two block smoothers applied to the Stokes equations dis- 
cretized using staggered grids in the unit square with periodic and Dirichlet boundary conditions. 
(a) Dirichlet boundary conditions. (b) Periodic boundary conditions 


Fig. 4 Order of iterations of the advanced block smoother. (a) Step 1. (b) Step 2. (c) Step 3. (d) 
Step 4 


3 Stencil-Specific Programming in ExaStencils 


The central element in the ExaStencils approach is its stencil-specific programming 
language ExaSlang and its code generator. Then there is also a language for 
specifying properties of the execution platform. The two languages are discussed 
in this section. The code generator is the subject of the following section. 


3.1 The Domain-Specific Language ExaSlang 


The idea of ExaStencils is to support the domain-specific programming and opti- 
mization of stencil codes by providing different layers of abstraction, specifying the 
various aspects of the stencil code at the respectively suitable layer, and exploiting 
domain information available at that layer for an optimization of the specification. 
ExaSlang comes in four layers: from ExaSlang 1, the most abstract, to ExaSlang 4, 
the most concrete (see Fig. 5). 


e Layer 1: the continuous problem 
This is the layer for the scientist or engineer who needs the solution of the PDE. 
The problem is specified as a continuous equation. The present implementation 
supports Unicode and LaTeX symbols. Optional inputs are the specification of 
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Fig. 5 ExaSlang layers of abstraction 


discretization and solver options used to autogenerate lower layers. There is also 
support for an automatic finite-difference discretization of operators. 

¢ Layer 2: the discrete problem 
This is the most abstract layer that provides an executable description. Discretized 
functions are fields (data type, grid location), tied to a computational domain. 
Geometric information is provided in the form of virtual fields resolved to 
constants or field accesses. Discretized operators are provided as stencils or stencil 
templates. 

e Layer 3: the solver 
At this layer, multigrid appears in the form of the specification of a solver 
for the discrete problem, either provided by hand or set up automatically. The 
implementation supports a Matlab-like syntax. 

¢ Layer 4: the application 
This layer of ExaSlang can describe a full application, including communication, 
input/output, evaluation and visualization, but is still more abstract than C++ or 
Java in that it contains language constructs specific to multigrid. Optimizations 
at this layer include the tuning of communication patterns and data structure 
layouts. 
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Yet more concrete, but not accessible to the user, is an intermediate representation 
(IR) in which most code refinements take place and which forms the base for the 
generating of the target C++ code. 

One may write one’s program at the ExaSlang layer of one’s choice and let 
the code generator refine it to a more concrete layer. One may also modify 
generated code to implement certain aspects that cannot be expressed at the chosen 
programming layer. However, the IR representation is not meant to be modified by 
the user. 

Through its four layers, ExaSlang evolves from a declarative language at layer | 
to an imperative programming language at layer 4. Besides the standard data types 
that represent floating-point and integer numbers or strings, domain-specific data 
types represent vectors and matrices to be used for coupled systems of equations. 
Assembly of global vectors and matrices is not supported since we focus on local 
computations using stencils. In contrast, stencils are declared globally in ExaSlang 
programs. Furthermore, fields—corresponding to vectors that may represent the 
PDE’s right-hand side or an approximation of the unknown function—are declared 
globally as well. Via declaration, certain settings important to parallelization may 
be specified by the user. One example is the size of ghost layers (also called halo 
layers), as depicted in Listing 5 in Sect. 3.2. 


3.2 An ExaSlang Example 


To illustrate how our language stack can be used to implement different aspects of 
partial differential equations (PDEs) solvers, let us take the Poisson equation 


—-Au=f in Q, 
u=g on dQ 


for a given domain &2—here the unit square—Dirichlet boundary conditions g = 
cos(mx) — sin(2zy) and the right-hand side f = m?cos(ax) — 4n?sin(2zry). 
Listing 2 shows an exemplary layer 1 code for this specification. From it, our 
generator is able to derive representations at subsequent layers. The refinement 
methods employed in this process are described in Sect. 4.1 further below. Listings 3 
and 4 illustrate variants similar to the autogenerated ones expressing the discretized 
version of the given equation (Listing 3) and the multigrid algorithm used to solve 
for it (Listing 4). Based on them, a complete layer 4 program can be assembled, 
comparable to the one illustrated in Listing 2. In the example code at the lower 
layers, parts of the source code have been omitted for the sake of compactness. A 
complete specification and examples of other PDEs are part of Sebastian Kuckuk’s 
dissertation [47]. 
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sop = - A 


9 


wuEq: Op * u == £ 
11 
2/x discretization and solver hints (see Subsection 4.1) */ 


Listing 2 ExaSlang 1 code for the complete specification of the 2D Poisson problem 


leioa sete l @, (@ I ea [| al, a | 


3Solution with Real on Node of global 
sSolution@finest on boundary = 

5 Cos ( Pi se x J = Ela ( 2.0 2 Pi & oy ) 
6Solution@(all but finest) on boundary 


sRHS with Real on Node of global = 
9 PI**2 * cos ( PI * vf_nodePos x ) - 
4.0 x PIx*2 x Sin ( 2.0 x PI * vf _nodePos y ) 


Laplace from Stencil 
l ©; Ol) ss 2.0 vfi_gridWidth_x««2 
2. vf_gridWidth_y*«2 
Sil, O =i; vf_gridWidth x+**2 
i, Ol ss =l, vf£_gridWidth_x««2 
Op call] Se Sab. vfi_gridWidth_ y«*+*2 
@, di] Ss =i. vf_gridWidth_y*«2 


a1SolEq { 
Laplace » Solution 


Listing 3 ExaSlang 2 code for a complete specification of the 2D Poisson problem 


iField Residual from Solution 
20verride bc for Residual with 0.0 


¡Operator Restriction from default restriction 

5 on Node with 'linear' 

sOperator Prolongation from default prolongation 
on Node with 'linear' 


o9Function Smoother@all { 
1 repeat 3 times { 
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11 Solution += diag iny ( Laplace ) * ( RHS - 
12 Laplace * Solution ) where (iO + il) % 2 == 0 
13 Solution += diag_inv ( Laplace ) «* ( RHS - 
14 Laplace » Solution ) where (iO + il) % 2 == 1 


Function VCycle@coarsest { 
/* implementation of a coarse-grid solver */ 


2Function VCycle@(coarsest + 1 to finest) { 
Smoother ( ) 


Residual = RHS - Laplace » Solution 
RHS@coarser = Restriction * Residual 


Solution@coarser = 0.0 
vCycle@coarser ( ) 


Solution += Prolongation@coarser » Solution@coarser 


Smoother ( ) 


Listing 4 ExaSlang 3 implementation of a V(3, 3)-cycle using an RBGS smoother 


iLayout DefLayout<Real, Node >@all { 
duplicateLayers = [1, 1] with communication 
3 ghostLayers = [i, 2] with communication 
i} 
6Field Solution< global, DefLayout, /* bc's */ >@finest 
7Field Solution< global, DefLayout, 0.0 >@(all but finest) 
sField RHS < global, DefLayout, None > 
o9Field Residual< global, DefLayout, 0.0 > 
10 
u/* operators as on layers 2 and 3 x/ 
12 
Function Smoother@all { 
ACOLO WIEN (LO FRIII 
loop over Solution { 
Solution += omega * diag_inv ( Laplace ) x 
( RHS - Laplace » Solution ) 


2 { 


} 


communicate Solution 


23/* VCycle functions +*/ 
24 

Function Application { 
» f/x initialization */ 
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repeat 10 times { 
vCycle@finest ( ) 


/* de-initialization */ 


Listing 5 ExaSlang 4 code of a full application with a fixed number of V-cycles to solve for 
Poisson’s equation discretized with finite differences 


3.3. The Target-Platform Description Language 


To be able to optimize a code adequately, one must know details of the execution 
platform. Our code generation process is governed by more than one hundred 
parameters that allow to select specific code refinements or to set device-specific 
properties. Examples include the use of vector units on CPUs and the corresponding 
instruction set to use, e.g., SSE or AVX on x 86 CPUs, NEON on ARM-based CPUs 
or even QPX for IBM’s POWER architecture. This yields a design space that is too 
large for users to be able to specify a (near-)optimal configuration of code generation 
settings. However, we may be able to derive sensible parameters from a structured 
description of the target platform. To this end, one element of ExaSlang is the so- 
called target platform description language (TPDL) [75]. One design goal was to 
increase the modularity and reusability of hardware component descriptions, such 
as CPUs or accelerators, to let users compose systems based on a repository of 
ready-made parametric snippets. By treating these in a fashion similar to the class 
concept in object-oriented programming languages, users can infer instantiations 
with parameters set appropriately. A short example describing an accelerator card is 
provided in Listing 6. It enumerates a number of technical hardware details, but also 
contains the important software information on the compute capability, i.e., which 
features of the target technology CUDA may be used. 


i<gpu name="Tesla_V100" role="worker"> 
<param name="compute capability" value="7.0" /> 
<param name="api" value="cuda" /> 
<memory size="16" unit="GigaByte" Type="HBM2"> 
<param name="bandwidth" value="900" unit="GigaBps"/> 


</memory> 
<core quantity="5120" frequency="1246" 
s frequency unit="MegaHz" /> 
»</gpu> 


Listing 6 ExaSlang description of an accelerator card 
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ival workers = predefinedQuery (tpdlTree, GetWorkingUnits ) 
2val workers50 = workers.filter(w => w.power > Watt (50) ) 


sworkers50.foreach (System.out.println(_)) 


Listing 7 Query to enumerate working units consuming more than 50 W of power 


A small, yet flexible library supports information retrieval from a TPDL specifi- 
cation. This enables DSL developers to check for certain information and aggregate 
or evaluate characteristics of the target platform. Developers need not worry about 
the instantiations and their parameters but may just use the discrete specification, 
since all the processing required has already been done. For many recurring tasks, 
predefined queries are available. Listing 7 shows a predefined query to return all 
working units in a system. Its result is filtered to retain only working units that 
consume more than 50 W of electrical power. 


4 The ExaSlang Code Generator 


4.1 Refinement of ExaSlang Programs 


The overall goal of ExaStencils has been to enable users to choose the layer 
most appropriate for them and code exclusively at this layer, e.g., by providing a 
continuous formulation of the problem to be solved at layer | and nothing more. 
Ideally, our framework would then automatically derive suitable discretizations 
(layer 2), solver components (layer 3) and parallelism particulars (layer 4). However, 
in practice, this requires domain knowledge whose automatic inference is beyond the 
capabilities of present software technology. We address this issue by introducing 
hint specifications which allow us to progress automatically to subsequent layers of 
ExaSlang. For example, at layer 1, discretization hints may be supplied. As Listing 8 
shows, continuous functions are discretized at certain points of a computational grid, 
such as node positions. We also support the specification of operator discretization 
using finite differences. An optional renaming is also possible at this stage. 


iDiscretizationHints { 
u => Solution on Node 
op => Laplace with "FiniteDifferences" on Q order 2 


| uEq 


s} 


Listing 8 A discretization hint block for a scalar equation in ExaSlang 1 


Then, we combine the equation provided at layer 1 and the discretization hints to 
synthesize a discretized form of the equation at layer 2. While layer 2 expresses the 
problem to be solved, its solution is specified at layer 3. This can be achieved either 
by implementing a suitable iterative solver by hand or by issuing a directive for our 
generate solver interface. Listing 9 illustrates such a directive. 
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igenerate solver for Solution in Eq with { 
solver_smoother_numPre 
solver_smoother_numPost 


solver smoother coloring = "red-black" 
solver _cgs = "ConjugateGradient" 


Listing 9 A generate solver statement in ExaSlang 3 


For cases in which no further modification is required, matching solver hints may be 
provided at the upper layers to set up the code in Listing 9 automatically, allowing 
users to work exclusively at one layer. The generated solver is by default a geometric 
multigrid variant. In the concrete case of Listing 9, our framework would generate 
a standard V-cycle using three pre- and post-smoothing steps of a red-black Gauss- 
Seidel (RBGS) and a conjugate gradient (CG) coarse-grid solver. Frequently, minor 
adaptations of the generated solver are necessary. They can either be implemented 
by taking the implementation generated at layer 3, adapting it and replacing the 
original generate solver directive with the result. A more generic approach is to add 
modifiers to the generate solver directive. They target usually a certain stage of the 
multigrid solver, e.g., the restriction or the correction, at one or more levels of the 
hierarchy. These stages can either be replaced completely with custom layer 3 code, 
or arbitrary layer 3 statements may be added to be executed before or after the stage. 
A more complex option exists for smoother stages, as illustrated in Listing 10 for 
the case of a block smoother to be used when solving for the Stokes equations on a 
staggered grid (see Sect. 5). 


ismootherStage { 
loopBase p solveFor { 
u@[0, 0] u@[1, 0] 
v@[0, 0] v@[o, 1] 


Listing 10 ExaSlang 3 code for a smoother stage for the 2D Stokes problem 


After the algorithmic specification at layer 3 is completed, it must be transformed 
to a program at layer 4. This requires, most importantly, the addition of data 
layout information for fields, loops to compose kernels, and suitable communication 
statements. Listings 11 and 12 illustrate the first half of an RBGS smoother. 


Solution += ( diag_inv ( Laplace ) 
* ( RHS - Laplace * Solution ) 


) where (i0 + il) 


Listing 11 ExaSlang 3 code for the first half of an RBGS smoother 
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icommunicate Solution 
2loop over Solution where ( iO + il ) % 2 
( 


Solution += diag _inv ( Laplace ) x 
Solution ) 


RHS - Laplace x 


Listing 12 ExaSlang 4 code for the first half of an RBGS smoother derived from its ExaSlang 3 
counterpart in Listing 11 


This description of features is by no means complete. More detail is available in 
Sebastian Kuckuk’s dissertation [47]. 

We designed the language stack of ExaSlang to enable maximum flexibility for 
users. They can either work at one level exclusively and make use of the hint system 
to generate more concrete specifications automatically, or implement different parts 
of their application at multiple levels, or mix both approaches. 


4.2 Generation of Target Code 


After the domain-specific program has been refined to a complete program spec- 
ification (ExaSlang 4), the IR code is subjected to many non-algorithmic trans- 
formations (recall Fig. 5 on page 416). Among others, parallelization techniques 
are applied to the code (see Sect. 4.4), memory layouts are selected and applied 
to variables (see Sect. 4.3), required code fragments are inserted, and finally the 
program is written to disk as C++ files. 

To express all these steps in a short syntax, we developed the code-transformation 
framework Athariac [76]. It allows to modify the tree-based representation of a 
program—called the abstract syntax tree (AST)—by using simple, yet powerful 
rewrite rules. Essentially, we specify a pattern in the AST, such as the node that 
represents a certain ExaSlang statement, and then define a structure with which 
to replace it. Of course, we can also remove nodes by specifying an empty 
replacement. A single rewrite rule is called a transformation, and a group of 
transformations is called a strategy. In Listing 13, a simple strategy for code- 
generation time evaluation of mathematical expressions is presented. First, the two 
identifiers E and PI are replaced by their constant numerical values. Then, we look 
for constant numerical values to be summed, e.g., 2.1 + 3.1415, and replace them 
with the actual result of the addition. Note that the constant-folding transformation 
must be applied multiple times for expressions involving multiple additions. 


ivar s = new DefaultStrategy ("simple strategy") 


28 += new Transformation ("resolve constants", { 
case Identifier("PI") => RealConstant (3.1415) 
1 case Identifier("E") => RealConstant (2.7183) 


s}) 
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6S += new Transformation ("constant folding", { 
7 case Addition(a : RealConstant, b : RealConstant) 


8 => RealConstant (a + b) 
o}) 
108S . apply 


Listing 13 A strategy with two simple transformations 


By chaining and conditional execution of strategies in a fixed order, ExaSlang 
programs can be refined for specific program configurations. Depending on the 
selected code-generation parameters and input-program size, between 200 and 300 
transformations are required to generate C++ code. In Fig. 6, the trajectory of a 
program from the intermediate representation to C++ output is depicted to illustrate 
the basic approach that applies to every program. Most of the optimizations are 
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Fig. 6 The transformation trajectory of an ExaSlang program from its IR to C++ 
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Fig. 7 AST size (number of nodes) during code generation for different variants of the 3D optical 
flow application, starting at ExaSlang 4 


applied in this trajectory. Naturally, a concrete code-generation path may differ, as 
users can skip certain optimizations, or transformations may not be applied because 
the preconditions required are not satisfied. For example, it does not make sense to 
generate code for heterogeneous target devices if none exist. 

In Fig. 7, program sizes during the code-refinement process of the three- 
dimensional optical flow application (see Sect. 5.3) are depicted for a selection 
of different parallelization techniques. For all variants, we see different stages of 
the process. First, information on the program is gathered and preparations are 
done, i.e., setting array sizes and memory layouts. Next, minor code refinements are 
applied, such as the replacement of stencil convolutions with their corresponding 
computational rules. Then, depending on the parallelization technique, memory 
layouts are imposed and loops are generated, respectively modified. The execution 
times of Athariac’s code generator are usually on the same order of magnitude as an 
invocation of the target C++ compiler, i.e., within a few seconds to minutes. More 
detail on Athariac can be found in a recent issue of the Proceedings of the IEEE [76] 
and in Christian Schmitt’s dissertation [71]. 


4.3 Target-Specific Optimization 


Stencil codes have a very simple structure but they are difficult to optimize since 
the memory bandwidth usually acts as a performance brake. In project ExaStencils, 
a diverse set of optimizations has been developed to overcome this. We describe a 
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small selection briefly here. More detail on all implemented techniques, including 
those not presented here, can be found in Stefan Kronawitter’s dissertation [43]. 


Data Layout Transformations Much effort of adapting stencil codes from one 
application or execution platform to another goes into making the data layout 
fit for best efficiency. One major limiting factor is usually the available memory 
bandwidth. Examples of such adaptations are color splitting for multi-color kernels 
or switching between an array of structs (AoS) and a struct of arrays (SoA). 

ExaSlang admits flexible layout transformation directives for the specification 
of arbitrary affine transformations that are then applied automatically by the code 
generator [46]. Even though different variants of a color splitting or an SoA-to- 
AOS transformation may be most commonly used, the ExaStencils code generator 
is by no means restricted to them. Moreover, the only modification required to adapt 
the memory layout of any field is the insertion of an affine function that specifies 
the transformation. No other part of the application, including data initialization 
and communication, must be modified to make the new layout generally available. 
Explicit albeit very simple modifications to other parts are only necessary if they 
are meant to use different data arrangements. In this case, additional fields must be 
inserted for each layout and copy kernels are necessary. 

This approach avoids unnecessary changes in the source code and constitutes a 
big advance in the ease of testing and evaluating different memory layout schemes in 
order to identify the best memory layout. There are other systems that offer similar 
transformation devices but not in this generality. 


Polyhedral Code Exploration Besides a layout modification, an affine transfor- 
mation can lead to the most efficient implementation of a frequent structure in 
stencil codes: the loop nest. A popular approach to selecting such transformations 
automatically is the polyhedron model for loop optimization [22]. As for the data 
layout, the search is also here for best data locality. However, established automatic 
transformation techniques based on the PLuTo algorithm [6, 12, 13] fail to yield 
optimal results. 

Our first attempt to select better transformations was a specialized variant of the 
PLuTo algorithm available in the ExaStencils code generator. While it is capable 
of detecting very good schedules for some stencil codes, it can also encounter 
problematic ones. For example, it fails completely for RBGS kernels. Thus, we 
developed a new, optimized, multi-dimensional polyhedral search space exploration 
for the ExaStencils code generator [44] that obtains in several cases better results 
than existing approaches, such as different PLuTo variants or PolyMage [58]. It 
also has the capability of specializing the search for the domain of stencil codes, 
which reduces the exploration effort dramatically without significantly impairing 
performance. An extreme but still beneficial approach is to choose the first schedule 
selected by our specialized search without any further evaluation. This may not lead 
to the best performance but it avoids the overhead of a complete exploration—and 
the performance improvement is still satisfactory: in most experiments it was only 
a few percentage points below the best variant explored. 
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Vectorization A third optimization focuses on the vector units available in most 
processor architectures that provide single-instruction-multiple-data (SIMD) paral- 
lelism. Their use is typically mandatory for highest performance. However, each 
architecture comes with its own vector instruction set. Intel x 86 features Streaming 
SIMD Extensions (SSE) or Advanced Vector Extensions (AVX) in several different 
versions, IBM’s BlueGene/Q provides Quad Processing eXtension (QPX), and some 
ARM processors implement the Neon instruction set. Even though all of these sets 
target the same problem, their implementations differ not only in detail but also in 
key aspects, such as the way in which data can be fetched from main memory. 

As a remedy, contemporary compilers are equipped with rudimentary automatic 
vectorization capabilities most of which are, unfortunately, not very effective. On 
top, the more advanced compilers can exclude some popular architectures and be 
costly or not widely available. Since the ExaStencils code generator comes with 
its own vectorization phase [43, 76], it avoids any dependence on a special target 
compiler. Currently, it supports Intel SSE3, AVX, and AVX2, as well as IBM’s QPX 
and ARM Neon. 


4.4 Parallelization 


To parallelize ExaSlang applications automatically, mainly two concepts must be 
implemented [48]. First, data must be partitioned and distributed across the available 
compute resources and, second, data between the partitions must be synchronized 
periodically. We realize the former by splitting our computational domain into 
blocks which are further subdivided into fragments. Each fragment holds a part of 
the computational grid and all data associated with it. This hierarchical approach is 
depicted in Fig. 8 and permits an efficient mapping to different execution platforms. 
For instance, blocks can be mapped to MPI ranks while each fragment inside is 
handled by a distinct OpenMP thread. Mapping single fragments to accelerators is 
also possible, as explained later on. The synchronization of data can be controlled 
by users at layer 4 via communicate directives. They specify the field to be 
communicated and can additionally be parameterized to communicate only certain 
parts, e.g., specific ghost layers. Each communicate directive triggers the generation 
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Fig. 8 Partitioning of the computational domain in ExaStencils [48] 
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of a function that implements the corresponding behavior. This permits the reuse of 
functions, which can become quite lengthy if the same communication is reissued 
throughout an application. The function contains code specifying an MPI exchange 
for fragments residing in different blocks and simple copy kernels for fragments 
inside the same block. For the interblock case, this includes copying the required 
data to buffers, calling the appropriate MPI functions, waiting for their completion 
and copying back data received. The communication buffers required are set up 
automatically by our framework and, additionally, are shared between multiple 
communication routines where possible. This minimizes the memory footprint as 
well as the allocation and deallocation overhead. 

Kernels updating data in one fragment can additionally be parallelized. This can 
be done either with OpenMP or, if available, via an accelerator, e.g., a GPU or 
reconfigurable hardware such as an FPGA. 


Accelerators To accelerate computations with Nvidia GPUs, we require multiple 
steps during code generation and developed a back-end to emit corresponding 
CUDA? code. First, fields must be extended to manage a CPU and a GPU version. 
We also introduce flags that keep track of field data being changed on the host or 
device side at this point. Next, compute kernels are transformed to their CUDA 
counterparts and wrapped by an interface function passing variables and fields, 
used inside the kernel, as parameters. Then we replace the original kernel with 
a user-controlled condition branching to either the original CPU kernel or the 
new GPU kernel. Both variants are extended to set the flags previously described 
after execution. The same flags can also be used to control the use of added copy 
operations between host and device. This ensures that data is always synchronized 
correctly while the overhead is minimized. 

As an alternative to CUDA, SYCL is a new technology for the inclusion of 
the OpenCL ecosystem into the world of C++. It strives to combine the strengths 
of both worlds, e.g., by allowing to use custom data structures and templates 
inside computational kernels, or by providing an implementation of the parallel 
C++ Standard Template Library that can be executed not only on multicore CPUs, 
but also on accelerators. Furthermore, it aims at a reduction of OpenCL boiler- 
plate code by enabling the direct issuance of parallel_for statements instead of 
forcing the user to declare kernels, map their arguments, etc. Additionally, it detects 
automatically and initiates memory transfers between host and device. 

We had to specialize our code generation workflow slightly to address the 
characteristics of SYCL. While, in ExaSlang, the dimensionality of a kernel can 
be arbitrarily large, SYCL only supports up to three-dimensional kernels, i.e., 3D 
iteration domains. Because of this, we implemented a custom mapping to linearize 
a kernel with more than three dimensions to a single dimension while maintaining 
correct indexing. Furthermore, SYCL has currently no built-in support for reduction 
operators, such as the sum of all data items. Since multigrid methods depend on 


2CUDA is a proprietary application programming interface and de-facto standard for Nvidia GPUs. 
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reduction operators for the computation of norms used to approximate the reduction 
of the residual, we implemented our own reduction operator and used a template 
function that is called in the corresponding places. Our approach uses standard 
parallelization techniques that should perform well on CPUs and GPUs. 

SYCL separates work items into work groups in the same sense in which these 
concepts are used by CUDA and OpenCL. While this concept is not paramount, our 
code generator can optionally specify the granularity of work groups and modify 
data accesses inside the computational kernels correspondingly. 

SYCL is merely a specification, as of yet without a complete and optimized 
implementation. We verified our SYCL back-end with the commercial product 
ComputeCpp by Codeplay and also with the experimental open-source implemen- 
tation triSYCL? [33]. Although SYCL still needs to mature, it has great potential as 
a unified middleware for a multitude of devices: in addition to targeting CPUs and 
GPUs, the two major vendors of reconfigurable hardware, Xilinx and Intel/Altera, 
are actively working to accept SYCL code. This will allow us to extend the pool 
of ExaSlang-supported hardware targets without having to develop a separate code- 
generator back-end, such as the one presented in the next paragraph. 


Reconfigurable Hardware A particularly interesting target platform for scientific 
computing is reconfigurable hardware, such as in field-programmable gate arrays 
(FPGAs). They are increasingly used in high-performance computing because of 
their high throughput and energy efficiency. We started with extending an existing 
code generator for imaging processing to produce a description of a simple V-cycle 
for further processing using C/C++-based high-level synthesis via Vivado HLS for 
Xilinx FPGAs [69, 70]. This laid the base for a custom Vivado HLS back-end in 
our code generator, resulting in the ability to emit algorithm descriptions for Vivado 
HLS stemming from user input in ExaSlang [73]. To test the capabilities of the HLS 
back-end, we considered a starting grid of size 4096 x 4096 and used a V(2,2)-cycle. 
For the coarse-grid solution at a size of 32 x 32, a corresponding number of Jacobi 
smoother iterations was applied. The biggest difference to the execution on a CPU 
or GPU is that all multigrid operators were laid out spatially on the FPGA chip. 
Another essential consideration in mapping algorithms to reconfigurable hardware 
is the handling of memory and data transfers. FPGAs provide only very scarce 
amounts of on-chip memory, called block RAM, that can be accessed with full 
speed. This implies that data needs to be stored off-chip, e.g., in RAM that can be 
found on the FPGA board or even in the host computer’s memory. As a consequence, 
these transfers need to be supported by the hardware with corresponding buffers, 
which are mapped to first-in-first-out (FIFO) hardware buffers. In an FPGA, FIFO 
buffers can be either composed of registers or implemented by on-chip block RAM. 

Another important concern of designing reconfigurable hardware descriptions is 
the usage of computational resources on the chip. For our initial experiments, we 
used single-precision floating-point data types and were able to fit the complete 
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design on a mid-range Kintex 7 FPGA. On this chip, our design was able to 
outperform an Intel 17-3770 in terms of throughput by a factor of approximately 
3. Switching from single to double precision does not change any performance 
aspects but merely requires more chip resources. As a result, we had to switch to the 
larger Virtex 7 FPGA to house the design but were able to retain the performance 
advantage. 

Based on these results, we worked on incorporating a more relevant coarse- 
grid solution algorithm into our hardware designs. By a clever arrangement of the 
required memory buffers and transfers, we were able to implement a conjugate- 
gradient solver and map it to the Virtex 7 using double-precision floating-point 
numbers [77]. This algorithm is usually not well-suited for FPGAs because of its 
random memory-access patterns that break the preferable pipelining model. As a 
consequence, data needs to be stored on-chip in block RAM to retain performance. 
However, this takes away resources required by other multigrid components, e.g., 
smoothers and inter-grid operators. We conducted several experiments and were 
able to outperform an Intel Xeon E5-1620 v3 in many cases [77]. As an opti- 
mization, we overlapped successive V-cycles, i.e., we were able to reuse multigrid 
operators on the FPGA when they were no longer required by the previous V-cycle 
iteration provided the data dependences were respected. Theoretically, this should 
nearly double the throughput, which was confirmed in our numerical experiments. 


4.5 Compositional Optimization 


The ExaStencils framework enables the automatic generation of geometric multigrid 
solvers from a high-level representation either by using the generate solver interface 
or by means of a direct specification of its algorithmic components at layer 3. How- 
ever, in many cases, the construction of an efficient and scalable multigrid method is 
a difficult task requiring extensive knowledge in numerical mathematics. Therefore, 
it is typically performed manually by a domain expert. Part of ExaStencils’ vision is 
to automate all steps from the specification of a problem in the continuous domain 
(layer 1) to the generation of an efficient implementation on the target platform. To 
achieve this goal, we construct geometric multigrid methods for solving systems 
of linear equations as program optimization tasks. We have developed a context- 
free grammar for the automatic construction of solver instances from a given 
set of algorithmic options, whose production rules are shown in Fig. 9 for the 
case of pointwise smoothers. Each rule defines the list of expressions by which 
a certain production symbol, denoted (e), can be replaced. To generate a multigrid 
expression, starting with the symbol (S), this process is repeated recursively until the 
produced expression contains only terminal symbols or the empty string à. A more 
detailed description, including the semantics of the expressions generated following 
this grammar, can be found elsewhere [78]. 

The effectiveness of an iterative method is determined by two objectives: its rate 
of convergence and its compute performance on the target platform. To estimate 
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Fig. 9 A formal grammar for the generation of multigrid solvers 


both objectives for a given multigrid solver expression in reasonable time, we 
employ automated local Fourier analysis (LFA), as described in Sect. 2.2, for the 
first and a roofline performance model [89] for the second objective. Our goal is 
to obtain the set of Pareto-optimal solvers with respect to both objectives. A direct 
enumeration and evaluation of all possible configurations leads to an exponential 
operation increase with the number of multigrid levels. Thus, a brute-force search 
is infeasible in most cases. As a remedy, we employ evolutionary computation, 
a family of stochastic optimization methods stemming from the field of artificial 
intelligence and inspired by the principle of biological evolution. These methods 
evolve a population of candidate solutions, which are iteratively improved through 
the use of so-called genetic operators. The operators are specifically designed for 
the given problem representation, in our case the set of expressions that can be 
generated according to the grammar described in Fig. 9. After the optimization 
is completed, we transform the expression of each Pareto-optimal solver to an 
algorithmic representation which is emitted in the form of ExaSlang 3 code that can 
then serve to evaluate the solver’s rate of convergence and compute performance on 
the target hardware. As a first step, we have implemented our optimization approach 
in Python,4 using the libraries LFA Lab [65] and DEAP [23]. We point the interested 
reader to preliminary results of the optimization of multigrid solvers for the steady- 
state heat equation on a multi-core CPU [78]. 

In the future, we intend to integrate this implementation into the generate solver 
interface for the automatic construction of efficient and scalable geometric multigrid 
solvers based on a given specification. 


*https://github.com/jonas-schmitt/evostencils. 
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4.6 Feature-Based Domain-Specific Optimization 


One new concept that ExaStencils introduced into high-performance computing is 
that of feature-based domain-specific optimization [5]. The central idea is to look at 
a source program (e.g., a stencil code or application) not as an isolated individual 
but as a member of a program family or software product line and to specify it by its 
commonalities and variabilities with respect to the other family members in terms 
of features. A feature represents a domain concept (e.g., a type of smoother or grid) 
that may be selected and combined with others on demand. With this approach, a 
wide search space of configuration choices can be reviewed automatically at the 
level of domain concepts and the most performant choices for the application and 
execution platform at hand can be identified. 

Features and configuration options can have a significant influence on the 
performance of the generated code. The influences of some might already be 
known to the developer of the system, but other influences may be opaque and 
likewise may be influences that arise from interactions among features, called 
feature interactions. Also, the influences that one may expect based on theory and 
domain knowledge may not match the actual influences in the program, which often 
depend on implementation details. 

Ideally, a developer or user knows the optimal choices for all features. The goal 
must then be to exploit domain knowledge to identify the most performant con- 
figuration. But since, as just explained, domain knowledge is unevenly developed 
and remains often incomplete, ExaStencils resorts to machine learning to derive a 
feature-specific performance model [80] that provides a comprehensible description 
of the effects of features and feature interactions on performance. The machine- 
learning techniques employ a set of configurations, the learning set or sample set, 
as input to learn from. 

To this end, we have developed a framework of configuration sampling and 
machine learning approaches [36, 39, 59, 80] that allow us to derive a performance 
model of a given code that is parameterized in terms of its features. This way, we 
can express performance behavior in terms of concepts of the domain and determine 
automatically optimal configurations that are tailored to the problem at hand, which 
we have demonstrated in the domain of stencil codes [29, 30, 32] and beyond (e.g., 
databases, video encoders, compilers, and compression tools) [36, 39, 80]. Our 
framework integrates well with the other parts of ExaStencils that use and gather 
domain and configuration knowledge in different phases. It is similar in spirit to 
the performance modeling tool Extra-P, developed in the SPPEXA projects Catwalk 
und ExtraPeak [14], though we concentrate on flexibility in choosing and combining 
different sampling and learning techniques. 

To demonstrate the usefulness of the machine-learning approach, we performed 
experiments on a large number of configurable software systems from different 
domains, among them different multigrid solvers including an implementation in 
ExaSlang [29, 30, 32]. To settle on a suitable learning set, we analyze the influence 


ExaStencils: Advanced Multigrid Solver Generation 433 


of different sampling strategies on the accuracy of the performance-influence 
models [80]. 

Performance-influence models are not only meant to determine optimal configu- 
ration but also to validate and refine domain knowledge. We must be sure that the 
observed influences on the performance of the system match with our postulated 
knowledge, otherwise the system or the knowledge base must be revised. In a case 
study of a multigrid system working on triangular meshes, albeit not written in 
ExaSlang, we were able to validate the domain knowledge of the developers of the 
system [32]. 

For complex systems, learning a performance-influence model that captures 
accurately the performance behavior of all individual configurations is often time- 
consuming and the resulting models can become very large (i.e., many terms 
describing only a small influence on the performance). However, it is not always 
necessary to learn such a complex model; it might be beneficial to learn faster a 
simpler model that is less accurate but accurate enough for a given use-case [39]. 
In a set of experiments, we have demonstrated the tradeoff between accuracy, 
complexity, and time to learn a performance-influence model and shown that even 
simple models can predict all configurations with high accuracy. Depending on the 
use case of the performance-influence model, it can be of greater value to get quickly 
a rough impression of the most relevant influences of configuration options rather 
than waiting for a long time for a more detailed but also more complex model. 


5 Case Studies 


An essential ingredient of ExaStencils was a collection of case studies to test 
our approach for performance, target performance, and versatility. The more 
illustrative ones are sketched in this section. Section 5.1 deals with scalar elliptic 
PDEs, Sect. 5.2 with image processing applications and Sect. 5.3 with aspects of 
computational fluid dynamics. Section 5.4 details how generated solvers can be 
coupled with existing code, in our case molecular dynamics simulations. Finally, 
Sect. 5.5 goes beyond code generation and discusses a design decision study of 
porous media applications. 


5.1 Scalar Elliptic Partial Differential Equations 


A prominent and well-researched PDEs example that describes the steady state of 
the distribution of a physical quantity u : R? —> R, such as heat in a solid medium, 
is 


-V-(aVu)=f in Q, 
u=g on dQ 
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for a given domain Q of dimensionality d, suitable boundary conditions, right-hand 
side f : R? > R and a thermal conductivity a : R? — R of the material to be 
simulated. Assuming a finite-difference discretization on a uniform grid, it can be 
observed that a constant a leads to a stencil with constant coefficients, whereas a 
variable a leads to one with variable coefficients. In the latter case, we always store 
all stencil coefficients, rather than values of a, unless specified otherwise. Lastly, if 
a is 1, the equation simplifies to the Poisson equation introduced in Sect. 3.2, which 
is given by —V7u = —Au = f. 

In previous publications [43, 46, 48, 72], we demonstrated our ability to generate 
highly optimized and massively parallel geometric multigrid solvers for this model 
problem. For example, we achieve weak scalability up to the full JUQUEEN 
supercomputer, i.e., using up to 458,752 cores across 28,867 nodes, for moderately 
sized problems of 1—16 million unknowns per core. This is a good result considering 
that the ratio of data to be communicated, e.g., via MPI, is quite high compared to 
the required amount of computation. Figure 10 summarizes the obtained time to 
solution for up to 7.3 - 10!* unknowns using a hybrid MPI/OpenMP parallelization 
for which one V(3, 3)-cycle takes about 3.6 s. A similar scaling behavior can also be 
observed in a scaling experiment on Piz Daint (see Fig. 11). With a largest problem 
of 7.3 - 10!2 unknowns solved, the ExaStencils software reaches for the Poisson 
equation a size comparable to the finite-element approach [27] developed in the 
TerraNeo project [8], where the solution of a stabilized tetrahedral linear finite- 
element discretization of a Stokes system is demonstrated with 1.7- 101? unknowns. 
This is based on a Uzawa-type smoother [20] for the Stokes system. A first direct 
comparison has been made recently [41]. 

In a further single-node case study, we compared ExaStencils against the High- 
Performance Geometric Multigrid (HPGMG) benchmark [2]. Here, on one core of 
an Intel Xeon E5-2630 v2, the vectorized code generated by Athariac (see Sects. 4.2 
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Fig. 10 Weak scaling for generated multigrid solvers using V(3, 3)-cycles to solve for Poisson’s 
equation in 3D on JUQUEEN [47, 48]. The largest problem solved consists of 7.3- 10!? unknowns 
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Fig. 11 Weak scaling on Piz Daint for constant coefficients (CC) and variable coefficients (VC). 
In all cases, the number of iterations required is nearly constant. Performance differences between 
CPU and GPU reflect the hardware characteristics 


and 4.3) was able to solve 14.6 times as many unknowns per second as HPGMG. A 
closer inspection, using LIKWID [83], revealed that this performance gain is mainly 
due to a higher MFLOPS rate and a much better memory bandwidth exploitation 
of the solvers generated by ExaStencils. Further details, including also software 
productivity metrics (e.g., lines of code and Halstead complexity measures), are 
available elsewhere [76]. As an example, see also our case study on molecular 
dynamics (Sect. 5.4). 


5.2 Image Processing 


Next, we consider applications in variational image processing. State-of-the-art 
denoising algorithms based on total generalized variation have been implemented in 
ExaSlang [26] using a preconditioned Douglas-Rachford iteration for the underlying 
saddle-point problem. Results show a speedup of more than 4 compared to a 
reference Matlab implementation on CPU. As an indication of the capability of 
solving systems of PDEs as well, our most prominent imaging case study is the 
implementation of optical flow solvers. 

Computation of the optical flow refers to the approximation of the apparent 
motion between two or more images which are part of an image sequence J. We 
used this application to illustrate ExaSlang’s higher-dimensional data types [74]. 

Let us assume that J (x, y, t) refers to a specific pixel (x, y) in an image sequence 
at time t, where t could refer to a certain frame in a video stream. Furthermore, we 
assume that a moving object’s intensity does not change over time. Then, for small 
movements (corresponding to small time differences between two images), we may 
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describe the movement of an intensity value at a pixel (x, y, t) as follows: 
I(x, y, t) = I(x +dx, y+dy,t+dt) 


After performing a Taylor expansion and reordering factors, we can define the 
partial image derivatives Iy = at Iy = ut and I; = a This leads to the temporal 


gradient Vel = (Iy, Ly, 1)? and the optical flow vector (u, v) = (2 g). which 


can be transformed to the following system of PDEs to be solved: 


—aAu+t+ I,(kut+ fv) = —1; l; 
—aAv+ Ty (xu + Iv) = TI; 


Here, a denotes the diffusion coefficient, which we set to 1 in this example. 
Furthermore, we set the time gradient J; also to 1 for simplification purposes. After 
discretization, we obtain the following stencil: 


—a 
ge 
—a 4a +I kly —a 
( a) I, Ty 4a + I? ( a) 


This stencil can be directly mapped to Exaslang and used in a smoother kernel, as 
illustrated in Listing 14. 

All corresponding variables, such as the unknowns u and v, are represented in 
ExaSlang as vectors of corresponding dimensionality. Thanks to our code generation 
approach, we can easily conduct parameter studies. For example, we varied the 
number of pre- and post-smoothing steps between 0 and 6 to study their influence 
on the convergence rate (see Christian Schmitt’s dissertation [71] for details). 
Furthermore, we switched vectorization for Intel’s AVX2 instruction set on and 
off. For our measurements, we used a single compute node equipped with an Intel 
17-6700 CPU. We varied the number of threads between one (1T) and eight (8T) 
and used LIKWID [83] for our measurements, which is a supported back-end for 
ExaSlang’s profiling capabilities. 

A comparison of the energy consumption in relation to the application’s exe- 
cution time is depicted in Fig. 12. Clearly, the application is limited by the 
available memory bandwidth, as a speedup can be observed when going from one 
to two threads, but additional usage of CPU cores does not improve the speedup 
further. The use of AVX2 yields a small performance improvement compared to 
the scalar variants and, consequently, results in lower energy consumption. The V- 
cycle configurations that performed best in this case study are the V(4,2)-cycle, the 
V(3,3)-cycle, and the V(2,4)-cycle. 
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iloop over fragments { 
2loop over Flow@current { 
Flow [next] @current = Flow[active]@current + ( 
( inverse ( diag ( SmootherStencil@current ) ) ) + 
( RHS@current - 


6 SmootherStencil@current * Flow[active]@current ) 


) 
s}} 


gadvance Flow@current 


Listing 14 Jacobi smoother definition using slots for the flow field 
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Fig. 12 Parameter study of different parallelization approaches and V-cycle configurations for the 
two-dimensional optical flow application 


5.3 Computational Fluid Dynamics 


Our next area of application is computational fluid dynamics (CFD), where we 
considered the solution of Stokes and Navier-Stokes equations [47, 50]. 

Let us turn first to the Stokes equations as introduced in Sect. 2.3. We dis- 
cretized the given linear system of PDEs using finite differences and volumes on 
staggered grids. Additionally, more advanced multigrid components are required. 
This includes overlapping block smoothers that may additionally be colored. In 
ExaSlang, both can be expressed concisely and intuitively, as demonstrated in 
Sect. 4.1. Choosing a suitable coarse-grid solver is also a highly efficient process in 
our framework as it can be configured via our generate solver directive. Currently, 
we support conjugate gradient (CG), conjugate residual (CR), biconjugate gradient 
stabilized (BiCGSTAB) and minimal residual (MINRES), all with optional restart 
mechanisms, as well as simply applying the smoother. Of course, implementing 
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one’s own solver at layer 3 or layer 4 is still possible. We found that, in the collection 
of variants above, a BiCGSTAB with an added restart mechanism works reasonably 
well for the present test case [47]. 

Extending this work towards the full Navier-Stokes equations is the next step. 
For incompressible fluids, they can be given as 


ðu 


1 
+(u-V)u—vV*u+—Vp= fa 
ot p 


V-u=0, 


on a given domain Q, which is square in 2D and cubic in 3D, with suitable boundary 
conditions and with u, p and fy as before, p as the density and v as the kinematic 
viscosity given by the ratio of the dynamic viscosity u and p. The switch from 
Stokes to Navier-Stokes raises two challenges. First, since the equations become 
time-dependent, a suitable time integration is required. We choose a traditional 
implicit Euler, which can be implemented easily at layer 3 or layer 4. For each time 
step, one can then call either a generated solver variant or a custom implementation. 
Second, the employed solver has to be able to deal with non-linear equations. To this 
end, we allow expressing suitable linearizations, such as those based on Newton’s 
method and Picard iterations [21, 47], at layer 2. The linearization chosen can be 
applied either locally or globally, and both variants can be expressed in tandem 
with the discretization approach at layer 2. Then, non-linear multigrid solvers based 
on the full approximation scheme can be tailored at layer 3 automatically. Scaling 
tests on the JUWELS supercomputer exhibit good strong scalability for 2.7 - 108 
unknowns and good weak scalability on up to 24,576 cores, the largest amount 
available to us at the time of the experiment [47]. The results are summarized in 
Figs. 13 and 14. 

This approach can also be extended to the scope of our most sophisticated 
application, namely the simulation of non-Newtonian and non-isothermal fluids. 
Originally, we ported a legacy FORTRAN code to ExaSlang 4 [50] resulting not 
only in a considerable reduction in code complexity and size of about one order of 
magnitude, but also in a reasonable gain in performance of also about one order 
of magnitude on average [50]. One prerequisite is the support of staggered non- 
uniform grids by our code generation framework and DSL, which we extended 
accordingly. After implementing fully parallel grid setup routines, we are now also 
able to generate applications executable on GPUs and cluster architectures from 
the same ExaSlang code. This application also serves as a motivation to extend 
our generator to apply domain-specific optimizations, such as our sophisticated 
loop-carried common subexpression elimination [45], which are able to improve 
performance even further. Finally, using the previously described layer 2 and layer 3 
capabilities allows for even more expressive and concise specifications, which have 
moreover a greater ability to reflect concepts familiar to domain scientists. 

Lastly, we should mention that we invested some time to explore alternative 
approaches, such as the Lattice-Boltzmann method, which can also be implemented 
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Fig. 13 Weak scaling of generated multigrid solvers for the Navier-Stokes equations in 2D and 
3D on JUWELS [47] 
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Fig. 14 Strong scaling of generated multigrid solvers for the Navier-Stokes equations in 2D and 
3D on JUWELS [47] 


in our DSL [64]. An early evaluation showed that applications generated with the 
ExaStencils framework are about a factor of 2 slower compared to the state-of-the- 
art multiphysics framework waLBerla.° 


5 www.walberla.net. 
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5.4 Molecular Dynamics Simulation 


To illustrate the versatility of the ExaStencils approach, we demonstrated that the 
ExaStencils compiler could be used to generate a multigrid solver that integrates 
into a large, existing simulation code. For this purpose, we considered a molecular 
dynamics simulation code used in practice [67, 68], a simulation that computes the 
motion of the electrons and nuclei of molecules or groups of molecules. 

The simulation is a so-called real-space density functional theory (RSDFT) 
simulation. A simulation in real space computes the wave functions on a discretized 
Cartesian grid, which avoids costly computations of the fast Fourier transform 
(FFT) that methods which work in Fourier space have to perform. The simulation 
computes the motion of the particles using the Kohn-Sham density functional 
theory, which is implemented using a Car-Parrinello molecular dynamics scheme. 

The RSDFT simulation requires the computation of the electron-electron 
Coulomb interaction potential of the charge density. This potential is computed 
efficiently by solving the corresponding Poisson equation using a multigrid method. 
The RSDFT code contains a manually derived multigrid solver, which we wanted 
to replace by a solver generated by the ExaStencils compiler. 

However, integrating a solver into an existing code base requires some additional 
steps above the ones necessary to generate a stand-alone solver. The RSDFT code 
has its own set of grid data structures for storage of the wave functions. To obtain 
optimal performance, the ExaStencils compiler generates a set of internal data 
structures that are the most suitable for the computation (see Sect. 4.3). These data 
structures are usually not compatible with the ones an application code uses. While it 
would be possible to force the compiler to use the RSDFT data structures, it would 
deprive the compiler of some of its optimization opportunities. As a workaround, 
we used the compiler to generate a set of data transfer routines that copy the grid 
functions from the RSDFT data structures to the internal ones and vice versa. 

In Table 1, we compare the performance of the automatically generated multigrid 
solver and the legacy solver that was manually written in C. Note that both solvers 
implement different coarse-grid solution strategies. Thus, the overall timings are 
not directly comparable, because the number of iterations change between the 
implementations. However, the time per iteration is indicative for the quality of 
the code of the generated solver, and we see that it is comparable to the time per 
iteration of the legacy code. The generated code is often slightly faster. 

We can conclude that the ExaStencils approach is also feasible in situations that 
involve an already existing codebase. 


5.5 Porous Media 


A further study, in cooperation with SPPEXA project EXA-DUNE [7], concerned 
the design decision made for stencil codes in the domain of porous media [31]. 
The goal of this study was to help application engineers understand the complexity 
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Table 1 Timings of the solution of the Poisson equation discretized on a grid of n points on 
JURECA [42] using the legacy solver and an automatically generated multigrid solver 


ExaStencils 


[e Joats [iaxiots [6 oms 
[e josis [ssxio?s [6 (098 
e joats |sixio?s [6 [027s 
[e foars |aaxto?s [6 [os | 


[e pzas [iiss fe feos 
bs_|173s_Josis fs fons 
[s Toms Joass Je Niers 
2 hss [020s |e ios 
[2e asss Joas fe Toss 
a fs Joos |6 [oms | 


of stencil computations and the relation between the mathematical model of an 
application and the stencil used to solve the partial differential equation underlying 
the application. To model these design decisions, we use feature orientation as 
introduced in Sect. 4.6. An example of a design decision is the choice of the type of 
boundary conditions that have to be satisfied by a given application. We differentiate 
between decisions at the mathematical and the stencil level, and we devise one 
feature model for each of these levels. Since decisions at the mathematical level 
may affect decisions that can be made at the stencil level, we also include constraints 
between these two levels. 

To demonstrate the usefulness of feature modeling, we considered a set of 
applications that deviate from the standard 5-point stencil used in many benchmarks. 
We demonstrated that feature models can be used beyond simple mathematical 
problems by considering problems of different complexity. Specifically, we started 
with the simple heat equation and went on to more complex equations, such as the 
advection-diffusion equation with operator splitting or Richard’s equation. For all 
of these equations, we were able to express the design decisions at the mathematical 
and the stencil level as well as their dependences. The result is a comprehensive 
set of feature models that can be uses to formally reason about the variabilities in 
stencil codes (e.g., to identify dependencies among configuration decisions). 
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6 Variants of the ExaStencils Approach 


6.1 ExaSlang 4 Embedded, Not External 


In the second funding period, ExaStencils was part of an international collaboration 
with Shigeru Chiba, co-funded by the Japan Science and Technology Agency (JST). 
This project has already been reported to and reviewed by JST as part of the 
Japanese strategic basic research programme CREST [17]. The summary is also 
available at the author’s Web site. 

The project aimed at a software architecture for embedded domain-specific 
languages. The cooperation of the DFG and JST in SPPEXA gave Chiba’s team 
the opportunity to explore the applicability of its software architecture in the 
context of ExaStencils. ExaSlang 4 is a stand-alone language that needs its own 
compiler and development environments, such as an editor and a debugger, and 
that has a relatively large development cost. The principle of embedding is an 
approach to reduce the size and cost of the infrastructure needed. An embedded 
DSL is implemented on top of a general-purpose host language as a library with 
a programming interface that looks like a language. Since an embedded DSL is 
a library, its development cost is usually smaller and programmers can reuse the 
environments of the host language. 

The result of the project was an embedded-DSL architecture based on run-time 
metaprogramming. In this architecture, a part of the running program is reified at run 
time; that is, its abstract syntax tree is dynamically extracted. Then it is translated to 
efficient binary code to be run. For example, the host language could be Java, while 
the program is translated to C/C++ and compiled to machine code. The idea is to 
view the extracted code as a program not in the host language, but in the DSL with 
a slightly different semantics. Hence, we can apply domain-specific optimization, 
such as partial evaluation, during translation. 

To evaluate this architecture, Chiba’s team developed two software systems. 
One is Bytespresso [16], which they used to implement an embedded DSL that 
mimics Exaslang 4. They were successful in developing the DSL with a similar 
programming interface and then measured the execution performance. Although 
the computational part alone experienced a loss by a factor of 3, the total execution 
including compilation was twice as fast in the largest case measured: 12 grid levels,’ 

which is still small-scale. This demonstrates that, for small-scale problems, the 
embedded version of Exaslang 4 is more suitable for interactive execution than the 
external version. 


Shttps://post-peta-crest.github.io/chiba/. 
7Figures of the measurements can be found in the CREST report [17]. 
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iFragments.loop_ over do 
Flow_u.current.loop_over do 
Flow_u[:next] .current = Flow_u[:active] .current + 
(oO / Clie; (SmootherStencil u.current)) + 


(RHS_u.current - SmootherStencil_u.current * 
Flow_u[:active] .current) ) 


Listing 15 Specification of an optical flow solver in Ruby-embedded ExaSlang 4 


Chiba’s team also explored the expressive power of the architecture it proposed 
in Yadriggy [15], a framework for embedding DSLs in Ruby. Compared to the 
ExaSlang 4 implementation in Java using Bytespresso, exhibiting a source code 
syntax far from the original, the Ruby-embedded version on Yadriggy features a 
syntax much closer to the original. This is due to the flexible syntax of Ruby and 
the development support by Yadriggy. For example, the code fragment depicted in 
Listing 15 is ExaSlang 4 code taken from the optical flow case study (Sect. 5.2, 
Listing 14) embedded in Ruby. This internal DSL code is not interpreted as Ruby 
code and, hence, the language does not have to adhere to Ruby’s semantics. 


6.2 A Multigrid Solver in SPIRAL 


Forerunner and motivator for the vision of ExaStencils was the US project SPIRAL 
(24, 61]. SPIRAL’s initial and foremost domain has been linear transformations. 
Recently it went on to small-scale linear algebra [81]. ExaStencils’ domain is a 
subdomain of multigrid computations. At project halftime, we put a very simple 
case of multigrid on SPIRAL: a solver with a Richardson smoother for a discretized 
square 2D Poisson equation with Dirichlet boundary conditions [11]. The central 
step is to bring the smoother into an algebraic form of about a dozen rewrite 
equations. It was an effort of a few days, but the present implementation of 
SPIRAL does not support the broader domain of ExaStencils. For the generation of 
efficient code, special adaptations are required as illustrated in the Bachelor thesis 
of Sebastian Schweikl [79]. This is a consequence of the fact that SPIRAL’s target 
domain has been signal processing and not the solution of PDEs. The applicability 
of SPIRAL in its present form to our domain is very limited. 
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7 The Legacy of ExaStencils 


7.1 Effort 


ExaStencils was not the only project in SPPEXA to address the solution of PDEs 
via multigrid methods. However, it was the only one that took a revolutionary rather 
than evolutionary software approach, i.e., that did not build on existing software 
tools or application software but started afresh with the automatic, domain-specific 
synthesis and optimization of application software via a set of dedicated domain- 
aware software tools. The challenge was to identify the domain knowledge that is 
useful in helping an optimization, to build the domain-aware tools and to generate 
some target codes that demonstrate that this approach is realistic and promising. 

A central ingredient of the approach is the stratification of the DSL ExaSlang into 
four layers of abstraction, each addressing a different set of aspects of the stencil 
code. The domain knowledge is twofold. Knowledge of the application problem is 
provided by hints that enable the ExaSlang compiler to optimize at the next-more 
concrete layer. These hints can be supplied as code annotations or conceptually by 
other means, e.g., configuration files or GUIs, as is the case in the ExaSlang Level 1 
editor documented in Tim Ammenhiuser’s Bachelor’s thesis [4]. Knowledge of the 
execution platform is provided by a dedicated platform description language. 

The main effort in the project concerned the design and implementation of the 
ExaSlang compiler and code generator which had to be built from scratch, since 
ExaSlang is an external DSL. This effort led to three dissertations of the six that 
emerged from the project [43, 47, 71] and two habilitation theses [35, 40]. In a 
sideline, a comparative study was made with an internal DSL mimicking the most 
concrete layer of the external ExaSlang, which proved the approach doable but not 
the first choice for exascale software. 

Another major effort was to drive the development of performance prediction 
further. On the software side, this will soon be realized by one cumulative dis- 
sertation introducing the, also revolutionary, generation of performance prediction 
models via machine learning in the setting of software product line engineering 
[28]. On the mathematics side, a dissertation drove local Fourier analysis for the 
convergence prediction of multigrid codes further [66]. Another dissertation in the 
realm of multigrid math drove the technology of multigrid smoothers forward [19]. 

The final major element was the case studies conducted to illustrate the impact 
of the ExaStencils approach. 


7.2 Outreach 


Let us sketch how ExaStencils profited from and nurtured other research projects. 
There was a bilateral exchange of ideas and best practices with SPPEXA 
project TerraNeo [8] that deals with multigrid solvers on block-structured grids 


ExaStencils: Advanced Multigrid Solver Generation 445 


for applications in geoscience. Both ExaStencils and TerraNeo have dealt in loose 
collaboration with high-performance multigrid codes. One result was an evaluation 
from the perspective of code generation technology [41]. A further systematic 
comparison of the performance regarding the algorithms, data structures and 
their implementation is left to the future. However, the generative programming 
technologies of ExaStencils are already being leveraged for the redesign of the 
TerraNeo code basis [38], where they help to simplify the software structure and 
to reduce the coding effort. 

In cooperation with SPPEXA project EXA-DUNE [7], we developed a semi- 
automatic variability extraction approach that generates a family of applications 
based on a given application [31]. An ExaStencils-generated application solves the 
same problem as the EXA-DUNE application but uses sets of methods different 
from the DUNE framework [9] with the goal of optimizing performance. For exam- 
ple, applications might use different grid implementations or different finite-element 
maps provided by the DUNE framework. The goal of this approach is to ease 
the burden of having to understand the DUNE framework when adapting a given 
application to new use cases and optimizing performance. In a first evaluation, the 
approach was able to identify over 90% of the alternative but compatible methods a 
developer of the framework has identified. Using this automated approach, we were 
even able to identify a bug and an inconsistency in the DUNE framework. 

A molecular dynamics simulation on the application platform RSDFT [67] 
profited from the ExaStencils approach. The port of a development in ExaSlang 
to RSDFT was successful. 

Applied researchers in triangular meshes profited from the ExaStencils 
approach of validating and refining domain knowledge via a performance-influence 
model [32]. In simpler words, an automatic learning procedure confirmed their 
design choices and their effects on performance and made them aware of others. 

In a technically related BMBF-funded project, called HighPerMeshes, which 
also aims at creating a DSL but for the scientific-computing domain of unstructured 
meshes [37], the parallelization technology SYCL was evaluated [3]. This fueled our 
interest in creating a SYCL back-end for our code generator which, in turn, provided 
valuable insights into possible code generation strategies and optimizations for 
project HighPerMeshes. 

SPIRAL [24, 61] was ExaStencils’ main motivator and an orientation point for 
positioning its contribution [11]. At present, SPIRAL is not as serious a platform 
for multigrid solver development as ExaStencils and it would support the algebraic 
rather than the geometric form of multigrid. 

Similarly, HIPA® [55, 57], a DSL embedded in C++ and a source-to-source 
translator for image processing applications that can target a wide variety of 
parallel accelerator architectures, inspired us on how to deal with both domain and 
architecture knowledge. In turn, at the start of ExaStencils, we introduced language 
constructs to HIPA® to process and represent data at different resolutions, which 
enables the specification of applications that work on image pyramids as well as 2D 
stencil-based multigrid methods. By decoupling the algorithm from its schedule, 
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HIPA“ allows the generation of efficient stencil-code implementations for single 
accelerators [56]. 

A recent DFG-funded project on ocean modeling (grant. no 320126236) chal- 
lenges the extensibility of ExaStencils’ application domain by considering time- 
dependent hyperbolic partial differential equations, discretized using finite-volume 
or higher-order discontinuous Galerkin methods on block-structured triangular 
grids. Since, so far, an explicit time-stepping scheme is being applied, no solvers 
like multigrid are required. However, it has been shown already that the ExaStencils 
framework is capable of generating efficient and scalable code for such applications 
on GPU clusters [49]. 


7.3 Potential 


The ExaSlang compiler and code generator remain at the stage of a prototype, but 
we believe that they have demonstrated that a stratified optimizing DSL approach 
can provide dramatically increased convenience for the application programmer and 
high flexibility concerning the execution platform. The effort of adapting a complex 
code to a different application or platform can be reduced significantly. The price 
to be paid is good knowledge of the theory on which the application rests and a 
significant development cost in implementing the DSL. The latter can be reduced by 
going the internal rather than the external way, at the price of reduced expressiveness 
and flexibility in terms of target code optimization. 

Our choice not to go with an embedded DSL tends to be not popular in present- 
day DSL development but is essential. We did not want to be bound by any language 
limitations in exploring the potential of the ExaStencils approach to domain- 
specific programming and optimization. To achieve high portability and execution 
performance of generated programs, we selected C/C++ as the target language. 

Our case studies covered the range from textbook examples to realistic appli- 
cations. The latter are represented by a simulation of non-Newtonian flow, fully 
developed in the ExaSlang world, and a molecular dynamics simulation which used 
an ExaStencils solver through an automatically generated compatibility interface. 
We have demonstrated that we can reach the expected performance, achieve porta- 
bility to most contemporary HPC platforms, and profit from a substantial decrease 
of development time of new PDE models that can be expressed in ExaSlang. 

Since the start of our project, code generation has become a standard technique 
in many HPC codes and, thus, the concepts developed in ExaStencils are used or 
can be adapted to other domains. Currently, a popular approach is the generation of 
single compute kernels from an embedded DSL for existing software frameworks. 
In addition, C++ template-based embedded DSLs are common. Frequently, both the 
concrete DSL and the resulting implementations are very application-specific, but 
the intermediate representations and code transforms can be defined quite generally 
and then optimized for the specific case. In the future, it should be possible to 
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extend the ExaStencils approach to other HPC applications to enable more holistic 
optimizations than currently established approaches. 
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Abstract Performance models are powerful tools allowing developers to under- 
stand the behavior of their applications, and empower them to address performance 
issues already during the design or prototyping phase. Unfortunately, the difficulties 
of creating such models manually and the effort involved render performance 
modeling a topic limited to a relatively small community of experts. This article 
summarizes the results of the two projects Catwalk, which aimed to create tools 
that automate key activities of the performance modeling process, and ExtraPeak, 
which built upon the results of Catwalk and worked toward making this powerful 
methodology more flexible, streamlined and easy to use. The sew projects both 
provide accessible tools and methods that bring performance modeling to a wider 
audience of HPC application developers. Since its outcome represents the final state 
of the two projects, we expand to a greater extent on the results of ExtraPeak. 


1 Introduction 


High-performance computing (HPC) is a key technology of the twenty-first century. 
Numerous application examples, ranging from the improved understanding of 
matter to the discovery of new materials and from the study of biological processes 
to the analysis of social networks, give evidence of its tremendous potential. Mastery 
of this technology will decide not only on the economic competitiveness of a society 
but will ultimately influence everything that depends on it, including the society’s 
welfare and stability. Moreover, there is broad consensus that high-performance 
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computing is indispensable to address major global challenges of humankind such 
as climate change and energy consumption. However, the demand for computing 
power needed to solve problems of such enormous complexity is almost insatiable. 
In their effort to answer this demand, supercomputer vendors work alongside 
computing centers to find good compromises between technical requirements, tight 
procurement and energy budgets, and market forces that dictate the prices of key 
components. The results are sophisticated architectures that combine unprecedented 
numbers of processor cores into a single coherent system, leveraging commodity 
parts or at least their designs to lower the costs where in agreement with design 
objectives. 

Exploiting the full power of HPC systems has always been hard and is becoming 
even harder as the complexity and size of systems and applications continues to 
grow. On the other hand, already today the savings potential in terms of energy and 
CPU hours that application optimization can achieve is enormous [5]. As the number 
of available cores increases at tremendous speed, reaping this potential is becoming 
an economic and scientific obligation. For example, an exascale system with a power 
consumption of 20 MW (very optimistic estimate) and 5000h of operation per year 
would—assuming an energy price of 0.1€ per kWh—produce an energy bill of 
10 M€ per year. 

Ever-growing application complexity across all domains, including but not lim- 
ited to theoretical physics, fluid dynamics, or climate research, requires a continuous 
focus on performance to productively use the large-scale machines that are being 
procured. However, designing such large applications is a complex task demanding 
foresight since they require large time investments in development and verification 
and are therefore meant to be used for decades. Thus, it is important that the 
applications be efficient and potential bottlenecks are identified early in their design 
as well as throughout their whole life cycle. Continuous performance analysis 
starting in early stages of the development process is therefore an indispensable 
prerequisite to ensure early and sustained productivity. 

Tuning an application means finding the sweet spot in its combined design 
and configuration space. Unfortunately, the sheer size of this space renders its 
exhaustive traversal via performance experiments prohibitive. In the absence of 
alternatives, many developers still rely on experiments, trying only a small and not 
necessarily representative subset of the available design and configuration options. 
Because of their limited view, they more than often overlook valuable optimization 
opportunities or miss latent performance limitations whose underlying trend they 
did not capture. 

Performance models, in contrast, allow the design space to be explored much 
faster and much more thoroughly. Although often based on simplifying assump- 
tions, they offer tremendous insight at the small cost of evaluating a formula. 
A model can be easily used to balance important trade-offs and adjust design 
parameters such that close to optimal performance is achieved. Such models 
allow problems in applications to be detected early on, long before they manifest 
themselves in production runs, and their severity to be determined when the cost of 
eliminating the problems is comparably small. If the problem is discovered later, 
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dependencies between its source and the rest of the code that have grown over 
time can make remediation much harder. However, finding performance models 
is both hard and time consuming, which is why many developers shy away from 
it. Sometimes such models are simply built on inaccurate back-of-the-envelope 
calculations, rough estimates, simple and manual spreadsheet calculations, or even 
only developer intuition, which may be misleading. 

An analytical performance model expresses the performance of an application 
in terms of a purely analytical expression [20, 27, 31]. A target metric m (e.g., 
execution time, energy, or number of floating-point operations) is represented as 
a function m = f(x1,...,%X,) of one or more parameters x; (e.g., the number 
of cores or the size of the input problem). To make statements about application 
performance that can be relied on under changing conditions, it is usually not 
enough to focus on any single parameter in isolation. The effect that one varying 
parameter has on performance must be considered in the context of the variation of 
other relevant parameters, including algorithm options, tuning parameters such as 
tiling, or characteristics of the input data set. However, often it is not obvious which 
parameters are truly performance-relevant and should be included. In general, the 
decision whether to include a certain parameter or not has to trade off different 
criteria. Models with fewer parameters are easier to generate and maintain and 
provide more high-level insight, whereas models with more parameters can be 
more accurate because they consider more effects. Abstract application performance 
models with a reasonably small number of parameters can be designed and 
maintained by application developers while a system model can only be provided by 
system experts. Simpler models can be used as an interface to application developers 
and algorithm designers, while more complex models can be used for detailed tuning 
and projections. The task of modeling the performance of an application is rather 
complex and time-consuming though. This is why—in spite of its potential—it is 
rarely used in practice. However, with the help of automatic tools that support the 
creation of accurate performance models, this powerful methodology could spread 
across a much wider audience of HPC application developers. 

This article summarizes the results of the Catwalk and ExtraPeak projects, 
which set out to improve this situation. The main goal of Catwalk, the first of 
the two projects, was to make performance modeling more attractive for a broader 
audience of HPC application developers by providing a method to create insightful 
performance models as automatically as possible, in a simple and intuitive way. 
Given the success of Catwalk, the follow-up project ExtraPeak aimed to improve 
upon the basic performance modeling method by allowing the models to include 
more than one parameter, while preserving the speed and accuracy of the original 
model generation process. Since ExtraPeak represents the more advanced state of 
our research and the outcome of Catwalk is already summarized elsewhere [47], 
we take the liberty of expanding mostly on the achievements of ExtraPeak. 
The two projects are part of a wider pioneering effort to construct performance 
models automatically in multiple application areas also beyond HPC, ranging from 
enterprise systems [7] to databases [15] and software product lines [40]. 
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2 Overview of Contributions 


Although we focus on the contributions of ExtraPeak, we start by offering a short 
overview of Catwalk and its achievements, given that the former builds upon the 
results of the latter. Figure 1 summarizes the accomplishments of both projects and 
their relationship to each other. Results from Catwalk are shown in hatched boxes 
while results from ExtraPeak are shown in solid boxes. 


2.1 Catwalk 


The most important goal of Catwalk was to provide an automated method for 
constructing performance models. The focus was on the discovery of scalability 
bugs, achieved by creating models describing the performance of different parts 
of a program when scaled to larger processor configurations and demonstrated 
using several realistic applications, including Sweep3D, MILC, and HOMME [10]. 
The method is the foundation of the performance modeling tool Extra-P, a major 
outcome of Catwalk, which has been released under an open-source license. Extra- 
P enabled numerous application case studies to showcase the type of insights 
this analysis can provide, including UG4 [44], an unstructured-grid package, as 


HOMME, JuSPIC, MAFIA, 
Mantevo, MILC, MPI, MP2C, 
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Fig. 1 Main contributions overview. Contributions in boxes with darker shades and solid fill 
represent work completed in the ExtraPeak project, whereas contributions in boxes with lighter 
shade and hatched fill represent contribution from the Catwalk project 
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well as several state-of-the-art MPI [37] and OpenMP implementations [24]. In 
Catwalk, we also worked on approaches leveraging the performance-modeling 
base method to allow new research avenues to be explored. The results include a 
scalability test framework that combines performance modeling with expectations 
to systematically validate the scalability of libraries [37], a tool that automatically 
instruments applications to generate performance models at runtime [4], and a fully 
static method to derive numbers of loop iterations from loops with affine loop 
guards and transfer functions that can be used to limit the performance model search 
space [22]. 


2.2 ExtraPeak 


In ExtraPeak, we improved the basic modeling approach and expanded it to allow 
the analysis of multiple parameters simultaneously, automating the search space 
generation for models and even handling discontinuities in the model space. A 
continuous goal in the development of our methods was to ensure flexibility and 
ease of use. Our main accomplishments can be divided into three categories: new 
features for and improvements of the base method, advanced methods building upon 
it to explore new ways of expressing and understanding performance, and finally 
case studies leveraging Extra-P to gain insights into the performance of specific 
applications. 


Base Methods The extensions of the base method cover both multiple model 
parameters and a refined model search for a single parameter: 


e We extended the basic approach to allow insightful modeling of any combination 
of application execution parameters while using heuristics to decrease the time to 
find the best models quickly without compromising their quality [11]. 

e We developed an algorithm to detect segmentation in a sequence of performance 
measurements and estimate the point where the behavior changes, allowing 
complex irregular behaviors to be modeled [23]. 

e We designed a new model-generation algorithm by which the search space is built 
and automatically refined on demand relieving the user from the burden of search 
space selection [34]. 


Advanced Methods The flexibility of Extra-P together with the width of models 
it can express and the insight they offer make it easy to tailor it to advanced tasks 
related to the exploration of multi-dimensional performance spaces: 


e Task-based programming offers an elegant way to express units of computation 
and the dependencies among them, making it easier to distribute the computa- 
tional load evenly across multiple cores. We introduced an automated empirical 
method for finding the isoefficiency function of a task-based program, binding 
efficiency, core count, and the input size in one analytical expression. The insights 
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gained via these expressions can be used to co-design programs and shared 
system resources [38]. 

¢ In the co-design process, a fundamental aspect of the application requirements are 
the rates at which the demands for different resources grow as a code is scaled 
to a larger machine. We showed how automated performance modeling can be 
used to quickly predict application requirements for varying scales and problem 
sizes [12]. 


Case Studies In ExtraPeak, we continued the series of case studies started in 
Catwalk, in which we confirm prior expectations or discover the existence of 
previously unknown scalability bottlenecks—often in collaboration with other 
research teams. Many of them were conducted when we extended our base methods 
or developed advanced methods and are described in the work cited above. In 
addition, we conducted further case studies with the sole aim of better understanding 
the performance of their application targets: 


e We helped validate the complexity of Relearn, a code that simulates structural 
plasticity in the brain. Inspired by hierarchical methods for solving n-body 
problems in particle physics, Relearn uses a scalable approximation algorithm 
with the complexity O(n - logn), which can simulate the structural plasticity of 
up to 10° neurons—four orders of magnitude more than the naive O(n’) version 
previously available [35]. 

e We investigated two implementations of the LLL lattice basis reduction algorithm 
in the popular NTL and fplll libraries, which helps to assess the security of 
lattice-based cryptographic schemes, to validate their complexity in practical 
usage scenarios. This task reverses the perspective of classic HPC applications. 
In the field of cryptography, high algorithmic complexity is a desirable trait that 
characterizes the hardness of breaking certain security protocols [9]. 

e We performed an analysis of parallel sorting algorithms and further MPI imple- 
mentations using the framework for continuous scalability validation previously 
developed during the Catwalk project [39]. 


In the remainder of the article, we further describe the extensions of the base 
method as well as the advanced methods building upon it. Since this article focuses 
on advances in automatic performance modeling, we cover only case studies 
conducted alongside these developments, but do not include more details on the 
remaining ones listed above. For those, we refer the reader to the cited literature. The 
overview of our technological contributions is followed by a summary of ongoing 
developments in our project, a review of related approaches, and eventually a brief 
conclusion. However, because it is fundamental to everything we did in ExtraPeak, 
we first provide a short introduction to Extra-P, the tool whose development was 
started in Catwalk. 
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3 Extra-P 


The key result of Catwalk has been a method to identify scalability bugs. A 
scalability bug is a part of the program whose scaling behavior is unintentionally 
poor, that is, much worse than expected. As computing hardware moves towards 
exascale, developers need early feedback on the scalability of their software design 
so that they can adapt it to the requirements of larger problem and machine sizes. 
In addition to searching for performance bugs, the models our tool produces also 
support projections that can be helpful when applying for the compute time needed 
to solve the next larger class of problems. For a detailed description, the reader may 
refer to Calotoiu et al. [10]. 

The usual input of our tool when used as a scalability bug detector is a set 
of performance measurements on different processor counts {p1,..., Pmax} in 
the form of parallel profiles. As a rule of thumb, we use five or six different 
configurations. The output of our tool is a list of program regions, ranked by their 
predicted execution time at a chosen target scale or by their asymptotic execution 
time. We call these regions kernels because they define the code granularity at which 
we generate our models. 


Model Generation When generating performance models, we exploit the obser- 
vation that they are usually composed of a finite number n of predefined terms, 
involving powers and logarithms of p: 


f(p) = docx: pik - logs (p). (1) 
k=1 


This representation is, of course, not exhaustive, but works in most practical 
scenarios since it is a consequence of how most computer algorithms are designed. 
We call it the performance model normal form (PMNF). Moreover, our experience 
suggests that neither the sets J, J C Q from which the exponents ix and jx are 
chosen nor the number of terms n have to be arbitrarily large or random to achieve 
a good fit. Thus, instead of deriving the models through reasoning, we only need to 
make reasonable choices for n, J, and J and then simply try all assignment options 
one by one. For example, a default we often use is n = 3, J = {$. 5, 5, 3, $, 3, $}, 
and J = {0, 1,2}. A possible assignment of all i, and jg in a PMNF expression is 
called a model hypothesis. Trying all hypotheses one by one means that for each 
of them we find coefficients cg with optimal fit. Then we apply cross-validation to 
select the hypothesis with the best fit across all candidates. As an alternative to the 
number of processes p, our method can also support other model parameters such 
as the size of the input problem or other algorithmic parameters—as long as we vary 


only one parameter at a time. 
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Our tool models only behaviors found in the training data. We provide direct 
feedback information regarding the number of runs required to ensure statistical 
significance of the modeling process itself, but there is no automatic way of 
determining at what scale particular behaviors start manifesting themselves. This 
version of our method is most effective for regular problems with repetitive 
behavior, whereas irregular problems with strong and potentially non-deterministic 
dynamic effects require enhancements which we detail in Sect. 4.2. 


Open Source Release The Extra-P performance-modeling tool has been made 
available online under an open-source license.! Users have access not only to the 
software but also to documentation material describing both our method and its 
implementation. We have been actively supporting the use of Extra-P at several 
organizations, among them the High Performance Computing Center Stuttgart, TU 
Darmstadt, Lawrence Livermore National Laboratory, FZ Jiilich, and the University 
of Washington, just to name a few. 

Figure 2 shows how the results of the model generator can be interactively 
explored. The GUI annotates each call path with a performance model. The formula 
represents a previously selected metric as a function of the number of processes, 
and allows other parameters to be represented as well. The user can select one or 
more call paths and plot their models on the right. In this way, the user can visually 
compare the scalability of different application kernels. 

The profiles needed as input for the model generator are created in a series of 
performance experiments. To relieve the user from the burden of manually submit- 
ting large numbers of jobs and collating their results, we use the Jülich Benchmark 
Environment (JUBE) [26], a workflow manager developed at Forschungszentrum 
Jiilich. 

The tool was presented at multiple tutorials at conferences such as EuroMPI and 
Supercomputing, as well as at numerous VI-HPS and HKHLR tuning workshops. 
Following a 90-min theoretical explanation of the method and the tool, users 
were able to model the performance of two example applications, SWEEP3D and 
BLAST, in a 90-min practical session. Using previously prepared measurement 
data, they were able to generate models for the entire codes, evaluate the results, 
and understand the scaling behavior of the two applications. With this knowledge, 
attendees are able to apply Extra-P to their own applications, once the required 
performance measurements have been gathered. Because Extra-P is compatible 
with Score-P an established infrastructure for performance profiling, even collecting 
these measurements is straightforward. 


'http://www.scalasca.org/software/extra-p/download.html. 
*http://www.scalasca.org/software/extra-p/documentation.html. 


461 


uonenuits 
yodsuen uonnou e ‘qEJJAMS Ul sued Jeo 10} payeisuas s[apoul souLUTIOJIEd SMOYS JOYS UAIS SY], Spouw sdueULIOJIed Jo uoneIojdxə IAYWIUJ 7 “SIL 


oseses 000'0 
3 4 : r 4 
3 0000r weus > sex 
Bpow mous a 
= tort ECO OUXELEE TFIO FS OTNER'T Wns WF pegao = 
s eset eco sazo (drbonsoro+sze's Pues idw 
mor somre mome? soner ovut SOUT erus ° wor L680  -OWSEEP “Hore OTKOPT'E+ZEO'O peor pes = 


ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications 


y ydes6 oun 


wE 
oc-OLNESOT 


lave - O12 966 74060 0 
8187450000 


ee DIXL0ZZ LSESS'S 

rosé BITO we OTKEFS'T “TIZH Ete- OTKORS'D mueg idn 

wer PSO  mOMIIST TELO Edw OTKESE'E DuA eq = 

aroo 0o00  m-otxrco't so-eseoet'9 woobyn 

tso 6890  m-OIKIETG ~THL9L'6 +O TEST mao 

6st CPFO  m-OWBEOT UNTLG S Foe OTXBETD Fuu a 

OPEL covo  sOmoert PLSEET'O squ 
w OOIT 0000 OTKE60'6 srousT 0 onena « 

1900 0000  otxses9 teszero soy = 
“vo OOO  m-OIXBZST “TNPBE Zs- OINZOT'S One auu, = 
eri ITO OUR T “POSLI OTKLZSD dwo29p 
zw SLUG OTX9OSL “THEBE Sve OTKEZE’T wean 

wert 0000 = -OTxorE'E SO-Ortgst't BANOA e 

erst GEO — me OIMOSEL TAEST'tree-O1HONT'Z seg lan | 
uyo GLOO ve OIMLEL ZS INLIT Ttm- OMLZE'L wr yng = 
caver SELO OXI RS “TPLO Ts ORE T wd peas = 
1959 SOO OTKESO'T UXTEP'Z +e OTX6L0T aud an 

wert TSHO ep OUXETL'T INGI Ets- OTXLOO'T WASIR = 

ure COO — ee OLKELO'T “Hove O1KEES'Z-STO'O wed ian 

ure ZELO pt -OLKZED'D —1*P9O'T+u-OTMPSE'Z wr yseaq = 
090 tero TIKOLE Tto- OIII? aue wuwo ian 

4990 6870 -OTISO T Oz wwo dW 

cree osso T1200 wore 
seee uso TXSPP'S re O1N98L'9 worse) a 
619 Ot LEO OTIS? “Jon-01x606 140100 soup = 
3avns Ai sou Onea ouno) pedea nos, 


462 A. Calotoiu et al. 
4 Developments of the Base Methods 


While the tool created during the Catwalk project is already widely used in the HPC 
community and provides meaningful insights to developers seeking to understand 
the performance of their codes, we have added significant new features during the 
ExtraPeak project which we detail below, most important among them the ability to 
model multiple parameters simultaneously. 


4.1 Multi-Parameter Modeling 


Common questions asked by developers when trying to understand the behavior of 
applications are: 


e How does application performance change when more processors are used? 
e How does application performance change when the problem size is increased or 
decreased? 


When considering the pressure on applications to judiciously use computing 
resources both questions must be answered, and a new vital question arises: 


e Are the effects of processor variation and problem-size variation independent of 
each other or can they amplify each other? 


For example, a weak-scaling run of the kernel SweepSolver in Kripke [28], a 
particle transport proxy application, has a runtime model for processor variation 
of t(p) = O( p" 3) and a runtime model for varying the number of dimensions of 
t(d) = O (d). The number of dimensions influences the problem size proportionally. 
It now needs to be determined how these two factors play together. Depending on 
their interaction, the application is scalable or not. For example, it would make a 
huge difference whether the combined effect of processor variation and number of 
dimensions was t (p, d) = O(p!/3 - d) or t(p, d) = O(p!7 +d). 

We expanded the original performance model normal form presented in Sect. 3 
to include multiple parameters. 


n m 
FO Xm) = Doce | [xy log Gx) (2) 
l=1 


k=1 


This expanded normal form allows a number m of parameters to be combined in 
each of the n terms that are summed up to form the model. Each term allows each 
parameter x; to be represented through a combination of monomials and logarithms. 
The sets 7, J C Q from which the exponents ix, and j, respectively, are chosen 
can be defined as in the one-parameter case. 

Of course, if multiple parameters are considered, performance experiments have 
to be conducted for all combinations of parameter values and the total number of 
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experiments that is required grows accordingly. While this might be manageable 
if the number of parameters considered is small enough and/or the cost of an 
individual experiment is very small, another and more serious problem emerges 
even for two and three parameters, namely the combinatorial explosion of the model 
search space. 

Therefore, multi-parameter modeling was outside the reach of automatic meth- 
ods due to the exponential growth of the model search space. We developed a new 
technique to traverse the search space rapidly and generate insightful performance 
models that enable a wide range of uses from performance predictions for balanced 
machine design to performance tuning. The details can be found in the work of 
Calotoiu et al. [11], but we present the most important heuristics here, together with 
a summary of the evaluation. 


Hierarchical Search The idea is to first obtain single parameter models for each 
individual parameter. Once we have these models, all that is left is to compare 
all additive and multiplicative options of combining said models into one multi- 
parameter model and to choose the one with the best fit. 

The size of the search space for this approach is as follows, given m parameters 
and one n-term model for each of them. We must combine all subsets of terms 
of each single-parameter model with each subset of terms of each other single 
parameter model. The number of subsets of a set of n elements is 2”, so the total 
size of the search space is 2”. 

Assuming there are three parameters, the single-parameter models for all of them 
have been computed and each model has three terms (the worst case scenario for 
search space cardinality in this case), the number of hypotheses that have to be 
tested is 2**? = 512. Adding the 3 times 25 steps needed to generate the single- 
parameter models, we will need to look at most at 587 models to find the best fit, 
compared to the 6.51 - 1014 in the unoptimized approach. 


Evaluation To evaluate the multi-parameter modeling approach we quantify the 
speedup of the model search in comparison to an exhaustive traversal of the same 
search space. Furthermore, we determine the frequency at which our heuristics lead 
to models that differ from the ones the exhaustive search produces. In those cases 
where the models we discover are different, we analyze these differences and discuss 
their impact on the quality of the results. Because traversing the entire search space 
for three or more parameters is prohibitively time consuming even with a very small 
number of potential terms, we allow only at most two model parameters for the 
purpose of this comparison. 

The evaluation is divided into two parts. First, we examine how closely the 
models generated both through exhaustive search and with the help of heuristics 
resemble inputs derived from synthetically generated functions. This allows our 
results to be compared with a known optimal model. Second, we compare the results 
of both approaches, when applied to actual performance measurements of scientific 
codes, which factors in the effects of run-to-run variation. 
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Synthetic Data We generated 100,000 test functions by instantiating our normal 
form from Eq. | with random coefficients. Our model generator responded in three 
different ways: 


1. Optimal models. The most common result (ca. 95%) is that the heuristically 
determined model, the model determined through an exhaustive search, and the 
known optimal model are identical. 

2. Lead-order term and its coefficient identified, smaller term not modeled by either 
method. In rare cases, neither modeling approach is capable of detecting the 
smaller term and they both only model the lead-order term. The effect on the 
quality of the resulting models is very small, and an attempt to model such small 
influences will often lead to noise being modeled instead. 

3. Lead-order term and its coefficient identified, smaller additive term only modeled 
by exhaustive search. In this case the heuristic approach fails to identify the 
parameter with a much smaller effect. The effect on the quality of the resulting 
model is again negligible. 


Table | displays the number of times the modeling identified the entire function 
correctly and the times only the lead-order term was identified correctly. The lead- 
order term was correctly identified in all test cases. The difference in time required to 
obtain the 100,000 models is significant: 1.5h when using the heuristics compared 
to 107h when trying out all models. 


Application Measurements In addition to synthetic data, we evaluated our heuris- 
tics with three scientific applications: Kripke, Cloverleaf, and BLAST. For BLAST 
we used two qualitatively different solvers and will therefore present separate 
results. Real data sets come with new challenges, such as not knowing the optimal 
model, and indeed no guarantees that the assumptions required for our method 
hold, namely that the optimal model is described by one and only one function 
and that the function is part of the search space. Figure 3 shows the results of both 
applying the heuristics and searching the entire solution space. As expected, in the 
overwhelming majority of cases the two approaches provide the same result (84%), 
or at least present the same lead-order term (14%). In about 2% of the cases the 
models differ. The reason is that noise and outliers occurring in real data sets are 
not limited to any arbitrary threshold compared to the effect of different parameters 
on performance. The projection used by the heuristics to generate single-parameter 


Table 1 Evaluation of heuristics using synthetic functions 


Search type | Heuristic | Exhaustive 
Optimal models identified 95,480 [95.5%] |96,120 [96.1%] 
Lead-order term identified 4,520 [4.5%] 3880 [3.9%] 
(including coefficient) 

Lead-order term not identified 0 [0%] 0 [0%] 


Modeling time | 1.5 hrs. | 107 hrs. 
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Fig. 3 Comparison of performance models obtained for all kernels of scientific applications using 
either our heuristics or a full traversal of the search space. For each application, we show the 
percentage of times where the resulting models were identical (left bar), where only the lead-order 
terms and their coefficients were the same (center bar), and where the lead-order terms were also 
different (right bar) 


models out of multi-dimensional data reduces noisy behavior to a higher degree than 
the exhaustive search does. Therefore, in these rare cases, the heuristic approach 
results in models with a slower growth rate than the ones identified through an 
exhaustive search. The optimal model is not necessarily the one identified by the 
exhaustive search, as noise could be modeled alongside the parameter effects. 

In all three cases, the model generation for an entire application took only 
seconds and was at least a hundred times faster than the exhaustive search. 
Generating performance models for an entire application means one model per call 
path and target metric. The search space reduction in all three cases was five orders 
of magnitude (from 4,250,070 model hypotheses down to 66 per call path and target 
metric). 


Discussion The evaluation with synthetic and real data demonstrates that our 
heuristics can offer results substantially faster than an exhaustive search—without 
significant drawbacks in terms of result quality. For three or more parameters, the 
size of the search space would have prevented such a comparison altogether, which 
also means that the exhaustive search presents no viable alternative beyond two 
parameters. 


4.2 Segmented Modeling 


Although our method is very powerful it struggles with the situation where perfor- 
mance data representing two or more different phenomena need to be combined 
into a single performance model, effectively being a function composed of multiple 
segments. This not only generates an inaccurate model for the given data, but can 
also either fail to point out existing scalability issues or create the appearance of 
such issues when none are present. 
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We have developed an algorithm to help Extra-P detect segmentation in the data 
before models are generated. Its input is a set of performance measurements, while 
the output indicates whether the given measurements show segmented behavior or 
not. If the data turns out to be segmented, the algorithm tries to identify the change 
point. With this information, Extra-P can generate separate models for each segment 
and/or request new measurements if any segment is too small for model generation. 
For more information, we refer the reader to Ilyas et al. [23]. 

Our method correctly identified segmentation in more than 80% of 5.2 million 
synthetic tests and confirmed expected segmentation in three application case 
studies. The results of this evaluation show that the proposed algorithm can be 
used as an effective way to find segmentation in performance data when creating 
empirical performance models. The suggested algorithm does not require any extra 
effort on the user’s side, and can work very well on as few as six points. 


4.3 Iterative Search Space Refinement 


While Extra-P strives to be easy to use and provide meaningful insights even to 
users without an extensive background in performance modeling, the version we 
developed in the Catwalk project still required the manual pre-configuration of the 
search space. This adds a layer of complexity to the modeling process, and requires 
the user to have an idea of what type of models can be expected from his application. 
Should the search space not include a model found in the code, the method will 
still try to approximate by selecting the one best fitting the data from the set of 
options available, but the result will be less accurate. Furthermore, noise in the 
data often leads to models that indicate a worse behavior than there actually is as 
sometimes a model in the search space fits not only the behavior we are trying to 
capture but also the noise, leading to overfitting. We have developed a new iterative 
model-generation approach, where we configure the search space on demand and 
iteratively raise the accuracy of the model until no meaningful improvement can be 
made. In this way, we increase both the ease of use for Extra-P and its range of 
application without sacrificing accuracy. For details, we refer the reader to the work 
by Reisert et al. [34]. In the following we show a summary of the results, showing 
the improvements that the new approach provides. 

We used measurements from previous case studies to evaluate our new algorithm 
on measured data. The measurements include a variety of call paths (i.e., kernels) 
and different metrics, such as runtime, number of function calls, memory footprint, 
and network traffic. 

The results of the comparison, which are presented in Table 2, show that, when 
the last (i.e., largest) measured data point is excluded from the data used to calculate 
the model, the model produced by our new algorithm allows for a better prediction 
of the last point in 19-65% of the cases, which corresponds to 53-85% of those 
models that changed in each benchmark. Although some predictions do get worse, 
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Table 2 Comparison of the original and our improved algorithm, using data from previous case 
studies, showing the quality of predictions of the last data point when that point is not used for 
modeling 


Benchmark Number of Model Model predictions (percent- Mean relative predic- 
points count age of all models) tion error [%] 

better same worse before now 
Sweep3D [11] 7 96 26.04 56.25 17.71 17.26 6.31 
HOMME [11] 9 670 18.81 68.51 12.69 3.69 3.03 
MILC [11] 9 1496 30.95 56.48 12.57 36.71 14.53 
UG4 [44] 5 2026 52.62 38.01 9.38 68.30 15.58 
MPI collect. [38] 7-8 26 65.38 7.69 26,92. 52.53 15.89 
BLAST [9] 3 103 31.07 41.75 27.18 34.92 10.38 
Kripke [9] 5 36 36.11 38.89 25.00 33.05 8.32 


Total 5-9 4453 39.12 49.11 11.77 45.71 12.97 


the mean relative prediction error decreases across all applications, in all but one 
case even significantly. 

Not shown in the table is the number of models that are constant, which has 
considerably increased in every single case study (from 44 to 76% overall). Because 
the synthetic evaluation has shown that our new algorithm is able to recognize 
constant functions more reliably, this indicates that the previous algorithm might 
have modeled noise or tried to fit a PMNF function to inaccurate measurements. 

With iterative refinement we remove the need for a predefined search space and 
also significantly reduce the number of false positives by being more resilient to 
noisy measurements of constant behavior. Most of the models generated with this 
new algorithm are able to make predictions that are equally or even more accurate 
than before. We therefore open the way for a performance modeling workflow that 
is more automated than ever and equips developers with a tool that helps them 
efficiently understand the performance of their applications. 


5 Developments of the Advanced Methods 


Extra-P is at its core a tool to generate a human-readable function out of a set 
of inputs. The flexibility it offers and the features we added over time made it 
attractive for uses beyond identifying performance bugs in parallel applications. 
Some approaches, such as the compilation and modeling framework by Bhat- 
tacharyya et al. [4], were developed during the Catwalk project. The scalability 
framework [37] was also developed during the Catwalk project, but we have further 
refined and expanded it during the ExtraPeak project, and also tested a number 
of parallel sorting algorithms using this framework [39]. Two research directions 
newly investigated during the ExtraPeak project are requirements engineering and 
iso-efficiency modeling, and are discussed below. 
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5.1 Lightweight Requirements Engineering 


Co-designing applications and the system is a powerful technique to ensure early 
and sustained productivity as well as good system design, especially in high- 
performance computing where the cost of systems is very high and applications 
are expected to remain in use for long periods of time. In their early phases, such 
co-designs often rest on back-of-the-envelope (BOE) calculations. In general, such 
calculations allow problems in applications to be detected early on and their severity 
to be determined years before the machine is installed or the first prototype becomes 
available. This is increasingly important since mitigating such problems can often 
take several personyears. On the system side, BOE calculations allow designers 
to adjust system parameters to target applications, for example, they can be used 
to determine the required bytes-to-flop ratio of memory, network, or even the file 
system. In addition, they can be used to determine required memory sizes, usability 
of accelerators and co-processors, and even the number of sockets and size of 
shared-memory domains in the target system. 

We automate these BOE calculations in a lightweight requirements analy- 
sis for scalable parallel applications. We introduce a minimal set of hardware- 
independent application-centric requirements that cover the most significant aspects 
of application behavior. Combining performance profiling [1, 8] and stack-distance 
sampling [3] with a lightweight automatic performance-modeling method [10, 11], 
we generate empirical models of these requirements that allow projections for 
different numbers of processes and problem sizes. 

As the foundation of our approach, we define a very simple notion of require- 
ments that supports their quantification in terms of the amount of data to be stored, 
processed, or transferred by an application. Knowing these numbers alone does not 
target a precise prediction of application runtime but can serve as an indicator of 
the relative importance of certain system resources and how this ratio changes as we 
scale a program to a larger system. Ultimately, our requirements are expressed in the 
form of empirical models that allow projections for different numbers of processes 
and problem sizes. 


Application-Centric Requirements We choose requirements to be purely appli- 
cation centric, that is, we do not make any assumption about the hardware other 
than the ability to run the code as is. Hence, all our requirement metrics refer to 
data flow at the interface between hard- and software—not between lower layers of 
the hardware. While specific hardware features could improve the rate at which the 
requirements are fulfilled, the classes of behavior our requirement models capture 
will not change. For example, even if revolutionary hardware features double 
the speed at which floating-point computations are performed, if the number of 
floating-point computations that need to be performed grows quadratically with the 
number or processes, while all other requirements remain constant, the floating- 
point requirement will remain the bottleneck for that particular application as it 
scales up. 
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Table 3 Requirement metrics 


Resource |Metric 
Memory footprint # Bytes used (resident memory size) 
Computation # Floating-point operations (#FLOP) 


Network communication |# Bytes sent / received 
Memory access # Loads / stores; stack distance 


Since it is currently the predominant programming model and also expected to be 
highly influential in the future, we stipulate that of each target application an MPI 
version exists. Application requirements are then expressed as a set of functions 
r(p, n) that predict the demand for resource r depending on the number of processes 
p and the problem size per process n. Because we regard thread-level concurrency 
merely as a way to satisfy the requirements, we consider requirements not below the 
granularity of processes, which may nevertheless be multithreaded—either locally 
or by launching GPU kernels. 

Currently, we consider the requirement metrics listed in Table 3, classified 
by the resource they refer to. I/O would be handled analogously to the network 
communication requirement. None of our analyzed applications includes significant 
I/O traffic, we therefore refrain from including I/O metrics in this analysis. Our 
metrics characterize application requirements in terms of space (i.e., memory 
consumption) and “data metabolism” (i.e., bytes processed in floating-point units or 
exchanged via memory and network). Because the amount of data moved between 
processor and memory subsystem alone is barely a reliable indicator of the pressure 
an application exerts on the memory subsystem, we also consider memory access 
locality. 


Co-design The key point of our method is to guide the programmer to find 
application bottlenecks relative to an architecture as well as to guide the architect to 
find system bottlenecks that a given application would experience. Our requirements 
models are functions of the number of MPI processes p and the input problem size 
n. To compare the requirements of an application on two different architectures, 
all we need to do is to calculate the application requirements using the values for 
p and n the application would use on these two systems. For details regarding 
the collection of requirement metrics and more detailed case-studies we refer the 
readers to Calotoiu et al. [12]. In the following we wish to present a brief example 
for the types of insights requirements modeling offers. 

LULESH is a widely studied proxy application in DOE co-design efforts 
for exascale which calculates simplified 3D Lagrangian hydrodynamics on an 
unstructured mesh. The problem size per process is defined as the simulated volume 
per process. The growth rates of all requirements with respect to both problem 
size and process count are very close to ideal. With the current implementation, 
the multiplicative effect process count and problem size per process have on 
computation and communication for LULESH is a small obstacle in tailoring and 
scaling the application to run on different systems. The growth rates are slow enough 
to limit these issues at anything except the most extreme scales. Having introduced 
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Table 4 Per-process requirements models 


Metric Model 
#Bytes used 10° -nlogn 
= #FLOP 10° -nlogn- p®*> logp A 
4 #Bytes sent & received 10° -n- p> logp A 
z #Loads & stores 10° - n logn - log p 
= Stack distance Constant 


the per process requirement models, we can now showcase the workflow to evaluate 
a possible system upgrade taking account of these requirements. Let us consider the 
scenario where LULESH is working on a given system, but needs to be deployed to 
a larger system of the same type, for example one having twice the number or racks. 

The requirements of LULESH are listed at the top of the table as part of Step I. 
Following this process, we can now draw conclusions regarding system utilization, 
requirements balance, and usefulness of a particular upgrade. The ratios between 
new and old problem sizes indicate how the largest problem size that can be solved 
changes, both per process and overall. The ratios between new and old requirements 
indicate which system components will experience an increased load relative to 
other components (Table 4). 

The requirements of LULESH can be expressed as the product of single- 
parameter functions that either depend the problem size per process or the number 
of processes. When doubling the racks, only the value of p changes, and in 
this particular case, all terms depending on n can be reduced when determining 
the ratios of the changing requirements. This means that these ratios are valid 
regardless of the problem size per process. This will not generally be true as it 
depends on the specific relative upgrade. That the number of processes affects 
computation and communication means that these requirements increase slightly. 
Luckily, computation and communication only increase by 20% and will therefore 
allow LULESH to solve an overall problem twice as large with only a small 
performance degradation. 


Discussion The workflow we propose leverages these models to enable system 
designers and application developers to ponder various upgrade and design options. 
We characterize performance in terms of relative requirement changes—from one 
system or one application to another. This pattern indeed matches the common case, 
where an initial version of an application running on an initial system already exists. 
And even if no such system exists, our approach can successfully help compare 
design options. The main advantage of our approach in relation to architecture- 
specific performance models, which are traditionally hard and laborious to produce 
with high accuracy, however, is the small effort on the one hand and the low 
complexity of the models on the other, facilitating quick insights at low cost—easily 
at the scale of an entire compute-center workload (Table 5). 
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Table 5 Workflow for determining the requirements of LULESH after doubling the number of 
racks 


I: Create requirement models for memory footprint, communication, computation, and memory 
access. 


Metric Process scaling and problem scaling 
#FLOP nlogn - p™” log p 
#Bytes sent & recv. n- p> log p 
#Loads & stores nlogn - log p 
#Bytes used nlogn 


II: Determine the new maximum number of processes and new memory available per process 
that the upgraded system supports. 


Configuration parameter Old New 
Processes count p p'=2p 
Memory m mw = 


IMI: Determine the new memory footprint requirement per process if all processors are used. 


Metric Old New 


#Bytes used nlogn n’ logn’ 


IV: Determine the new problem size per process such that the memory footprint equals the 
memory available to each process and compute the new overall problem size. 


Metric Old New Ratio 
Problem size per proc. nlogn=n'logn’=m 1 
Overall problem size pn pn 2 


V: Determine the new requirements for computation, communication, and memory access. 


Metric Old New Ratio 
#FLOP p’>logp (2p)%” log 2p 21.2 
#Bytes sent & recv. p’>logp = (2p) log 2p 21.2 


#Loads & stores log p log 2p z 
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5.2 Configuring and Understanding the Performance of 
Task-Based Applications 


Task-based programming models, such as Cilk [6] or OpenMP [32], are well known 
and as the number of cores per node continues to increase, they gain more and 
more attention. One major advantage of task-based programming is that it allows 
parallelism to be expressed in terms of tasks, which are units of computation that 
can be either independent, dependent on a previous task, or a prerequisite to a 
subsequent task. Explicitly expressing parts of the code as tasks allows the compiler 
to take care of all the thread management intricacies, thereby sparing the user from 
tedious low-level details. 

However, ensuring that that the problem to be solved is large enough to require 
a certain number of tasks is a difficult problem, and requires extensive analysis. 
The efficiency of the program will decrease as more processing elements are added. 
The only way to ensure that efficiency remains constant, as the number of cores 
increases, is to increase the input size as well. This concept is embodied in the iso- 
efficiency relation [17], which binds the number of processing elements (PEs) the 
application uses to the input size. It specifies by which factor the input size has to 
increase, with respect to the increase in the number of PEs, to maintain constant 
efficiency. Isoefficiency can be generalized to a two-parameter efficiency function 
that provides efficiency values as a function of both the PE count and the input size. 

Although isoefficiency analysis is useful in understanding the scalability behav- 
ior of algorithms, it is not straightforward to apply and requires deep knowledge 
of the algorithm. In practice, however, task-based algorithms experience hardware 
limitations in the form of resource contention in general and memory contention 
in particular. Resources such as cache and memory controllers are limited and 
can negatively impact application scalability [46]. These might render theoretical 
isoefficiency functions not accurate enough to be used in practice. To be able to 
make informed decisions as to how big the input size should be in order to use all of 
the allocated cores efficiently, the user not only has to have a realistic isoefficiency 
model but also needs to understand the severity of resource contention at higher 
scales. 

We proposed a novel practical method to automatically model the empirical effi- 
ciency functions of task-based applications [38]. Modeling the efficiency function 
allows us to easily derive an isoefficiency relation for any realistic target efficiency, 
and a carefully designed framework allows replays with different contention 
assumptions. 

In our approach we identified three different efficiency functions for a task-based 
application: 


1. Eac(p, n): The actual efficiency function of the application, modeled after the 
empirical results of runtime benchmarks. In this case the application runs as it is 
and experiences contention. Therefore, this function reflects realistic application 
performance including resource contention and scheduling overhead. 
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2. Ecf(p, n): The contention-free efficiency function, modeled after the results of 
replaying empty task skeletons according to the application’s task dependency 
graphs. The replay uses the same task dependency graphs and scheduling policy 
as in the original runs that were benchmarked to produce Eg-(p,n). Since 
the replay is free of resource contention, this efficiency function reflects an 
ideal situation in which the application does not experience resource contention 
caused by threads accessing the same resource simultaneously. 

3. Eyp(p,n): An upper bound on the efficiency of the application. Since efficiency 

is defined as Sew, an upper bound on the speedup also limits the efficiency. 

Considering the average parallelism 7 (n) for a problem size n, we determined 

that Sp(n) < min{p, z(n)}, thus we define E,)(p,n) = min{1, zw), This 

function describes an ideal situation of maximum speedup that is hardly 
achievable in practice. 


Beyond simply uncovering fundamental scalability limitations in an algorithm, 
we can provide insights into the impact of resource contention and determine what 
input size is required for any given degree of parallelism to reach a given efficiency. 
Further questions that our approach can answer are: What is the required core count 
for a given input size such that we maintain a constant, given efficiency? Which 
efficiency can we expect for a given number of cores and input size? Both questions 
are related to the co-design process when hardware designers have to understand 
how to make future systems suitable for both existing and future applications. 
Details regarding the approach and the framework required can be found in the 
work by Shudler et al. [38]. 

For an idea of the type of insights we provide, we show the summary of results 
for a number of task-based benchmark applications. Table 6 presents the efficiency 
models of the evaluated applications. There are 3 rows for each application 
listing the three efficiency models that we created (i.e., Eup(p, n), Eac(p, n), and 
Eef (p, n)). In all the models the logarithms are binary. The rRMSE column is the 
relative root-mean-square error. It is a standard statistical factor that measures the 
relative differences between the observed data and the model, and is defined as: 


tRMSE = o/y, where: o = ,/ X; (f (xi) — yi)? /n, yi are observed data, and y is 


the mean of the y; values. The last column shows the input size n, derived from our 
models by letting the efficiency E be 0.8 and the core count p be 60. 

All of the Eac(p, n) and Ect (p, n) models follow the same pattern C— A- f (p)+ 
B- f (p)g(n) that empirically emerged from our measurements. The interpretation of 
this pattern is that the first term, the constant C, is approximately 1 and it denotes the 
maximum attainable efficiency. The second term, — A - f (p), reflects the reduction in 
efficiency that occurs when we increase the core count. The last term, B - f (p)e (n), 
denotes the efficiency that we gain when we increase the input size. Together these 
terms reflect the interplay between the core count and the input size, and the effect it 
has on the efficiency. In the case of FFT, the constant B in the last term of Eac(p, n) 
is very small, which means that resource contention is a very significant factor 
and even large increases of the input size are not enough to offset the drop in the 
efficiency. 
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Table 6 Efficiency models of the evaluated applications 


Application Model rRMSE Input size for p = 60 
Eac 1.09-0.51 - yp +3.11- 107? - yp logn 9.7% 37,718 X 37,718 
Cholesky Eef 1.14 -0.54 : yP + 3.4 : 10? ; yp logn 7.8% 24,685 x 24,685 
Eup min{1, (2.29 + 2.35 - 10° - n) - p`! } 2.4% 19,500 x 19,500 
Eac 0.96 — 0.1 - log p + 5.08 - 1072n45 log p 19.5% 30,310 x 30,310 
FET Eep 1.03 — 0.16 - p67 + 1.04 - 107? - p®-67 logn 4.8% 15,800 x 15,800 
Eup min{1, (1.19 - 107? - 9-67 logn) - p7!} 4.1% 5,800 x 5,800 


Eac 0.98 -5.11 -1073 - p!- +1.76- 1073 -pL logn 3.5% 51 
Fibonacci Eef 0.97— 1.46 - 10° - p!” +9.26 - 10° - pp! logn 3.0% 51 


Eyp min{1, (25.48 + 0.49 - n>” logn) - p7!} 15% 49 
Eac 1.04-0.66 - yp +0.17 - yp logn 13% 14 
NQueens Eef 1.0- 6.21 -10° ; p +1.61 - 107? - plogn 3% 13 
Eup min{1, (2.18 -n?-8 logn) - p~!} 6.6% 12 
Eac 0.99 -9.2 -10° - p!34+2.29-104-p!?logn 1.9% 350G 
Sort Ecy 1.0-4.61 - 107? - p°% +1.62 : 1073 - p% logn 5.7% 6.6M 
Eup min{1, (3.53 + 3.32 - 107? - yn) - p7! } 6.7%  1.7M 
Eac 1.02 — 0.46 - p% + 3.28 - 10°? - p% logn 6.3% 12,000 x 12,000 
SparseLU Eef 1.05 — 0.48 - p®-®7 + 3.49 - 10°? - p“ logn 6.1% 11,000 x 11,000 
Eup min{1, (5.8 -10 -n!-? logn) - p7!} 1.7% 7,800 x 7,800 
Eac 1.55 — 1.02 - p> +4.59 - 10°? - p?” logn 9.5% 83,600 x 83,600 
Strassen Eef 1.26 — 0.65 - p™33 +3.89 - 107? - p™33 logn 5.9% 12,680 x 12,680 
Eup min{1, (0.25 -n°-7) - p7! } 2.4% 1,200 x 1,200 


By analyzing the discrepancies between these efficiency functions, we are able to 
provide answers to questions regarding co-design aspects, the connection between 
poor scaling and resource contention, optimization potential, and the presence of 
scalability bugs. 


Discussion Our approach is viable for analyzing both the effects of resource 
contention on efficiency and further optimization potential. It provides users with 
an insight into whether the obstacle to scaling is resource contention or insufficient 
parallelism in the structure of the task dependency graph. In addition, users can also 
calculate the required input sizes to keep efficiency constant on a given core count. 
This approach can be used in co-design analysis to understand how many processing 
elements to put in a future machine, such that we can have high efficiency with 
realistic application input. 


6 Ongoing Work 


While Extra-P is already a powerful and versatile tool, we believe there are still some 
areas where we can improve and streamline our approach even more. Our efforts 
are currently focused on methods that would make it easier to consider multiple 
parameters in the performance modeling process. In the following, we briefly 
discuss two promising and so far unpublished approaches, one aiming to reduce 
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the number of inputs the modeling algorithm requires to generate accurate multi- 
parameter models, and another targeting the use of compile-time information to 
identify relevant parameters and formulate expectations regarding their interaction, 
further reducing the need for measurements while improving the quality of the 
resulting models. 


6.1 Reducing the Cost of Measurements with Sparse Modeling 


We have shown how useful the performance models we generate can be to 
developers, but even though the modeling is done cheaply and automatically, we still 
require a series of small-scale experiments in order to start the process. Therefore, 
the experiment design determines the quality of the model as well as the overall cost 
of the modeling process. The current state of the art requires at least five different 
values for each parameter, and measurements with all possible combinations of 
values for all parameters considered. Therefore, an exponential number of samples 
is needed, namely 5” if n parameters are being modeled. For specific applications 
this makes it impractical to even create performance models. We are working 
on a novel parameter sampling approach that utilizes reinforcement learning, and 
leverages a sparse modeling technique, which only needs a polynomial number of 
samples and allows a more flexible experiment design. 

We have made the observation that Extra-P assumes that there is one and only 
one behavior with respect to each parameter across the entire measured space. If 
this is true, the same function terms describing the effect of a given parameter 
should be identified no matter which sequence of five measurements is considered 
as long as the effect of all other parameters are kept constant. Rather than requiring 
all combinations of all values for each parameter, it could be sufficient to select 
a sequence of five measurements for each parameter to create single-parameter 
models, but a thorough analysis is required to ensure that lowering the number 
and cost of measurements is not detrimental to the quality of the results. When 
considering the interaction of parameters a new challenge arises: the binary decision 
of whether effect of any parameter pair is additive or multiplicative cannot be 
made with only a sequence of five measurements for each parameter. At least one 
additional data point is required, one that is not part of those sequences. In our 
evaluation, the addition of this one additional point improves the number of correctly 
identified models from 81.1 to 99.9%, while more data points only marginally 
improved the results. 

Figure 4 shows a set of measurements that is usually sufficient to correctly 
identify two-parameter performance models. Of course, any of the columns and 
rows could be used to generate the performance model. The question as to how to 
select which rows and columns to measure as well as which additional points to 
consider such that the best models can be generated with the smallest cost is still 
open. While selecting the combination with the smallest cost is appealing, we must 
quantify how the quality of the models degrades compared to other strategies. For 
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Fig. 4 Example set of performance measurements for a two-parameter analysis. The filled circles 
represent a subset that is likely to be sufficient to create a performance model 


this purpose, we leverage reinforcement learning to compare and evaluate different 
strategies. 

Using this approach on some of our existing case studies shows very promising 
preliminary results: we were able to reduce the average modeling costs by up to 93% 
while maintaining 99% of the model accuracy. We are currently in the process of 
analyzing the limits of this approach and trying to define if there are any conditions 
required for it to be successful, or if it is a valid solution for most applications. 


6.2 Taint Analysis for Many-Parameter Modeling 


The current workflow of Extra-P follows three major steps: parameter identification, 
designing a set of experiments to measure the influence of changes in the parameters 
of interest, and estimating the best model from the provided data. A lot of work in 
Catwalk and ExtraPeak focused on automating and improving the third step in this 
process, leaving the user to still identify parameters of interest and choose which 
values to give these parameters for the subsequent measurements that will serve as 
input to Extra-P. 

We are prototyping a tool called perf-taint to alleviate this issue. Perf-taint is 
an LLVM-based hybrid program analysis integrated with Extra-P that will supply 
program information to the modeling process. We use taint analysis [14], a computer 
security technique which reliably relates marked input values with the program parts 
they potentially affect to determine which parameters influence the performance 
critical control-flow in the program and detects functions that are constant with 
respect to selected performance parameters. 

We have preliminary results using two benchmarks: LULESH and su3_rmd from 
the MILC suite. In LULESH, we consider three parameters and prune 303 out 
of 347 functions. In su3_rmd, we consider four parameters and prune 364 out of 
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621 functions. The instrumentation overhead is decreased 45 times in the case of 
LULESH compared to a full instrumentation. While these massive improvements 
in runtime overhead are likely specific to object-oriented C++ applications, the 
quality of the resulting models is increased across all our experiments. We hope this 
approach will allow better models to be generated with even less effort even for 
large, complex applications where the importance of individual parameters is not 
necessarily well understood. 


7 Related Work 


Performance analysis and prediction of real-world application workloads is most 
important in high-performance computing. Performance tools such as Score-P [1] 
allow the programmer to observe the performance of real-world applications at 
impressive scales but are often limited to observations of the current configuration 
and do not provide insight into their behavior when being scaled further. 

Such insights can be obtained with the help of analytical performance models, 
which have a long history. Early manual models showed to be very effective in 
describing application performance characteristics [27] and understanding complex 
behaviors [33]. Hoefler et al. established a simple six-step process to guide the 
(manual) creation of analytical performance models [21]. The resulting models 
lead to interesting insights into application behavior at scale and on unknown 
systems [2]. The six-step process formed the blueprint of our own approach. 

Various automated performance modeling methods exist. Tools such as 
PALM [43] use extensive and detailed per-function measurements to build structural 
performance models of applications. The creation of structural models is also 
supported by dedicated languages such as Aspen [42]. These methods are powerful 
but require the prior manual annotation of the source code. 

Hammer et al. combine static source-code analysis with cache-access simulations 
to create ECM and roofline models of steady-state loop kernels [19]. While their 
approach uses hardware information gathered on the target machine, it does not 
actually run the code but relies on static information instead. Lo et al. create 
roofline models for entire applications automatically and attempt to identify the 
optimal configuration to run an application on a given system [29]. Extra-P, in 
contrast, identifies scalability bugs in individual parts of an application rather than 
determining the optimal runtime configuration on a particular system. 

Vuduc et al. propose a method of selecting the best implementation for a given 
algorithm by automatically generating a large number of candidates for a selected 
kernel and then choosing the one offering the best performance according to the 
results of an empirical search [45]. Our approach generates performance models for 
all kernels in a given application to channel the optimization efforts to where they 
will be most effective. Zaparanakus et al. analyze and group loops and repetitions 
in applications towards automatically creating performance profiles for sequential 
algorithms [49]. Goldsmith et al. use clustering and linear regression analysis to 
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derive performance model coefficients from empirical measurements [16]. This 
approach requires the user to define either a linear or power law expectation for the 
performance model unlike the greater freedom offered by the performance model 
normal form defined in our approach. Jayakumar et al. predict runtimes of entire 
applications automatically using machine-learning approaches [25]. 

Zhai, Chen, and Zheng extrapolate single-node performance of applications with 
a known regular structure to complex parallel machines via simulation [50], but 
require the entire memory that would be needed at the target scale to correctly 
extrapolate performance. Wu and Miiller [48] showed how to predict the communi- 
cation behavior of stencil codes at larger scales by extrapolating their traces. While 
still requiring an SPMD-style parallel execution paradigm, Extra-P has proven to 
work with general OpenMP or MPI codes beyond pure stencil codes. 

Carrington et al. introduced a model-based performance prediction framework 
for applications on different computers [13]. Marin and Mellor-Crummey utilize 
semi-automatically derived performance models to predict performance on different 
architectures [30]. Siegmund et al. analyze the interaction of different configuration 
options and model how this affects the performance of an application as a whole 
rather than looking at its individual components [18, 41]. 

Reducing the burden of collecting the measurements required for the empirical 
learning process is a research effort in its own right. Sarkar et al. [36] suggest 
a powerful sampling approach which can be used if all features of interest are 
boolean. Another approach for sampling highly configurable systems with boolean 
configuration options by Zhang et al. [51] suggests using the Fourier transform 
to select the best samples. However, these methods cannot be directly adapted to 
our use case: the features modeled by Extra-P are allowed a much wider range 
of expression. They can be not just boolean, but functions with polynomial and 
logarithmic terms. 


8 Conclusion 


In the Catwalk project, we initially set out to prove that automated performance 
modeling is feasible and that automatically generated models are accurate enough 
to identify scalability bugs. We started by showing that in those cases where 
hand-crafted models existed in the literature our models are competitive. Our 
interaction with many different users from different fields taught us that approximate 
models are acceptable as long as the effort to create them is low and they do not 
mislead. Furthermore, being able to produce many performance models cheaply 
helps drastically improve code coverage, which is as important as model accuracy. 
Having approximate models for all parts of the code can be more useful than having 
a model with 100% accuracy for just a tiny portion of the code or no model at all. 
Finally, after the public release of the Extra-P software and numerous tutorials 
where Extra-P was introduced, we have seen growing interest from HPC application 
developers—whether for immediate use or in incorporating Extra-P in their own 
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research. The continuous development of Extra-P during the ExtraPeak project 
was partly driven by feature requests from the users themselves, and while the 
collaboration uncovered many challenges, the results invariably proved useful 
beyond the problem they were specifically developed to solve: The capability of 
modeling the impact of multiple parameters simultaneously paved the way for 
complex approaches such as using application-centric requirements in the co-design 
process or determining the iso-efficiency of task-based parallel applications. We 
confidently claim that Extra-P is a powerful tool capable of providing insightful 
performance information for most developers while requiring only a modicum of 
experience in performance analysis and few resources. 
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Abstract The FFMK project designs, builds and evaluates a system-software 
architecture to address the challenges expected in Exascale systems. In particular, 
these challenges include performance losses caused by the much larger impact of 
runtime variability within applications, hardware, and operating system (OS), as 
well as increased vulnerability to failures. The FFMK OS platform is built upon a 
multi-kernel architecture, which combines the L4Re microkernel and a virtualized 
Linux kernel into a noise-free, yet feature-rich execution environment. It further 
includes global, distributed platform management and system-level optimization 
services that transparently minimize checkpoint/restart overhead for applications. 
The project also researched algorithms to make collective operations fault tolerant 
in presence of failing nodes. In this paper, we describe the basic components, 
algorithms, and services we developed in Phase 2 of the project. 


1 Introduction 


The operating system (OS) abstracts from low-level aspects of a computer system’s 
hardware by providing applications with standardized programming interfaces 
and common services such as file systems and network access. By design, it 
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stands between the hardware and all applications. In the high-performance comput- 
ing (HPC) community, the OS is therefore sometimes considered to be “in the way” 
as applications try to extract maximum performance from the underlying hardware. 
Indeed, the OS can introduce overhead, as we will discuss in the following. 
But challenges posed by upcoming Exascale systems such as load imbalances or 
failures due to increasing component counts can benefit from system-level support. 
Therefore, the central goal of the FFMK project has been to investigate how the OS 
can actually help, rather than be a source of overhead. 

In the following paragraphs, we summarize the general architecture of the 
FFMK OS platform and give an overview of its higher-level services. In part, this 
description is an overview of results from Phase 1 of the project; but it shall also 
help put the results presented in this paper into context. For a much more detailed 
discussion of FFMK and the motivation behind it, we refer to our previous project 
report [59]. 


Multi-Kernel Node OS Figure 1 shows the architecture of the FFMK node OS. 
It is built on a multi-kernel foundation comprising an L4 microkernel and a 
variant of the Linux kernel that is called L Linux. We aim to support unmodified 
HPC applications and they shall have access to the same runtime libraries and 
communication drivers that are used on standard Linux-based HPC OSes. We 
target noise-sensitive applications by providing jitter-free execution directly on 
top our microkernel [33]. In this context, we also investigated the influence of 
hardware performance variation [61]. However, our vision for an HPC OS platform 
also includes new platform management services to support more complex and 
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Fig. 1 FFMK software architecture: compute processes with performance-critical parts of (MPT) 
runtime and communication driver execute directly on L4 microkernel; functionality that is not 
critical for performance is offloaded to the L*Linux kernel, which also hosts global platform 
management and fault-tolerance services 
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dynamic applications, as well as algorithms and system-level support to address 
fault-tolerance challenges posed by Exascale systems with unprecedented hardware 
component counts. 


Applications, Runtimes, Communication HPC applications are highly special- 
ized, but they achieve a certain level of platform independence by using common 
runtime and communication APIs such as the Message Passing Interface (MPI) [18]. 
However, just proving an MPI library and interconnect drivers (e.g., for InfiniBand) 
is not sufficient [60], because the majority of HPC codes use many Linux-specific 
APIs, too. The same is true for most HPC infrastructure, including parallel file 
systems and cluster management solutions. Compatibility to Linux is therefore 
essential and this can only be achieved, if applications are in fact compiled for Linux 
and started as Linux processes. 


Dynamic Platform Management The FFMK OS platform is more than a multi- 
kernel architecture. As motivated in the Phase 1 report [59], we include distributed 
global management, because the system software is best suited to monitor health 
and load of nodes. This is in contrast to the current way of operating HPC 
clusters and supercomputers, where load balancing problems and fault tolerance 
are tasks that practically every application deals with on its own. In the presence 
of frequent component failures, hardware heterogeneity, and dynamic resource 
demands, applications can no longer assume that compute resources are assigned 
statically. 


Load Balancing We aim to shift more coordination and decision making into the 
system layer. In the FFMK OS, the necessary monitoring and decision making is 
done at three levels: (1) on each node, (2) per application instance across multiple 
nodes, and (3) based on a global view by redundant master management nodes. We 
published fault-tolerant gossip algorithms [3] suitable for inter-node information 
dissemination and found that they have negligible performance overhead [36]. We 
further achieved promising results with regard to oversubscribing of cores, which 
can improve throughput for some applications [62]. We have since integrated the 
gossip algorithm, a per-node monitoring daemon, and a distributed decision making 
algorithm aimed at automatic, process-level load balancing for oversubscribed 
nodes. However, one key component of this platform management service is still 
missing: the ability to migrate processes from overloaded nodes to ones that have 
spare CPU cycles. Transparent migration of MPI processes that directly access 
InfiniBand hardware has proven to be extremely difficult. We leave this aspect for a 
future publication, but do we do summarize our key results on novel diffusion-based 
load balancing algorithms [39] in this report. These algorithms could be integrated 
into the FFMK load management service once process-level migration is possible. 


Fault Tolerance The ability to migrate processes away from failing nodes can also 
be used for proactive fault tolerance. However, the focus of our research on system- 
level fault tolerance has been in two other areas. First, we published on efficient 
collective operations in the presence of failures [27, 31, 43]. Second, we continued 
research on scalable checkpointing, where we concentrated on global coordination 
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for user-level checkpointing [22] and how to optimize it based on expected failure 
rates [21]. 


In the following two sections of this paper, we discuss how re-architecting the 
OS kernel for HPC systems can improve performance and scalability (Sect. 2); we 
also present results on how to increase kernel scalability beyond the dozens of cores 
found in contemporary systems, as well as new load balancing algorithms. We then 
make the case that system-software support is essential to address fault-tolerance 
challenges posed by Exascale systems and how new fault-tolerant algorithms can 
help improve robustness and performance of collective operations (Sect. 3). Each 
individual subsection on these pages summarizes our peer-reviewed work in the 
respective area. 


2 Building Blocks for a Scalable HPC Operating System 


2.1 The Case for a Multi-Kernel Operating System 


Noise-Sensitive Applications A widely reported problem in HPC is “OS noise” [5, 
16, 26, 49, 53], where sporadic or periodic housekeeping activities of the OS (or 
other background tasks) briefly interrupt application threads. These interruptions 
can slow down applications based on the bulk-synchronous programming (BSP) 
model. BSP applications are parallel programs that are characterized by alternating 
computation and communication phases that all participating threads must perform 
in perfect synchronization to maximize throughput. If just a few compute threads 
are preempted by background activities, all other threads that depend on their input 
will waste CPU cycles as they wait for the stragglers to reach the communication 
phase. As shown in Fig. 2, delays can amount to hundreds of thousands of cycles, 
resulting in a slowdown of up to 9% for a computation that takes 1.5 ms to complete 
when there is no interruption. 


+9% 


a +450000 z 
L +400000 +8% 
© +350000 tT% 0 
> +300000 +6% © 
®% +250000 +5% 3 
xæ +200000 +4% o 
S +150000 +3% Z 
= +100000 +2% 5 
S  +50000 +1% © 
< +0 +0% 
O 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 


Rounds 


Fig. 2 OS noise during a run of the fixed work quantum (FWQ) benchmark on a node of a 
production HPC cluster with Linux-based vendor OS 
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Re-Architecting the Kernel for HPC One way to address the OS noise problem 
is to partition the compute resources of each node into two sets of cores: compute 
cores that are allocated exclusively to HPC applications and service cores that 
run management and maintenance tasks. Multi-kernel OS architectures implement 
this approach by assigning the compute cores to a lightweight kernel (LWK); a 
traditional kernel such as Linux runs management and node monitoring daemons on 
the service cores. The LWK does not preempt compute threads, thereby minimizing 
execution-time jitter for applications. However, a LWK for HPC does not replace a 
complex kernel such as Linux. It only implements functionality that is critical for 
application performance; system calls that do not impact performance are offloaded 
to the Linux kernel running on the service cores. The multi-kernel approach gives 
HPC applications the best of both worlds: the LWK ensures low noise and high 
performance, whereas Linux offers convenience, familiar APIs, a rich feature set, 
and compatibility with huge amounts of legacy infrastructure. 


A Microkernel as a Universal LWK Multi-kernel OS architectures have received 
a lot of attention in recent years and several new LWKs have been developed, 
including IHK/McKernel [55] and mOS [63]. However, the maintenance effort for 
keeping an HPC-suitable multi-kernel OS up to date and compatible with Linux 
must be smaller than constantly patching mainline Linux to make it less noisy. 
Arguably, the best way to meet this requirement is to reuse components that already 
exist and that are actively maintained. In the FFMK project, we therefore chose 
to use the mature L4Re microkernel and LfLinux as the basis of the FFMK node 
OS. In contrast to McKernel and mOS, the L4Re microkernel manages not just 
the designated compute cores, but all processor resources. The L*Linux kernel runs 
virtualized on top of L4Re as shown in Fig. 1 on page 484. 


2.2 L4Re Microkernel and I4Linux 


In this Subsection, we quote! from a previous publication [60] an overview of 
the L4Re ecosystem, including the L4Re microkernel and the paravirtualized Linux 
kernel L*Linux. These two basic building blocks are combined into a foundation of 
a highly flexible and low-noise node OS. In Sect. 2.3, we evaluate the main benefits 
of this multi-kernel architecture. 


L4 Microkernel The L4Re microkernel is a member of the L4 family of micro- 
kernels. The core principle of L4 [40] is that the kernel should provide only the 
minimal amount of functionality that is necessary to build a complete OS on 
top of it. Thus, an L4 microkernel is not intended to be a minimized Unix, but 
instead it provides only a few basic abstractions: address spaces, threads, and inter- 


'This description has been shortened and slightly edited for brevity; see [60] for the complete 
version. 
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process communication (IPC). For performance reasons, a thread scheduler is also 
implemented within the kernel. However, other OS functionality such as device 
drivers, memory management, or file systems are provided by system services 
running as user-level programs on top of the microkernel. 


Applications and User-Level Services L4Re applications communicate with each 
other and with system services by exchanging IPC messages. These IPC messages 
can not only carry ordinary data, but they may also transfer access rights for 
resources. Being able to map memory pages via IPC allows any two programs 
to establish shared memory between their address spaces. Furthermore, because 
it is possible to revoke memory mappings at any time, this features enables user- 
level services to implement arbitrary memory-management policies. In much the 
same way an L4Re program can pass a capability referencing a resource to another 
application or service, thereby granting the receiver the permission to access that 
resource. A capability can refer to a kernel object such as a Thread or a Task, 
representing an independent flow of execution or an address space, respectively. But 
they may also point to an Ipc_ gate, which is a communication endpoint through 
which any user-space program can offer an arbitrary service to whomever possesses 
the corresponding capability. 


I/O Device Support An important feature of the L4Re microkernel is that it maps 
hardware interrupts to IPC messages. A thread running in user space can receive 
interrupts by waiting for messages from an Irq kernel object. In conjunction with 
the possibility to map I/O memory regions of hardware devices directly into user 
address spaces, it is possible to implement device drivers outside the microkernel. 


Virtualized Linux The L4Re microkernel is a fully functional hypervisor capable 
of hosting virtual machines running unmodified guest operating systems. It employs 
hardware-assisted virtualization on instruction set architectures that support it, 
including x86, ARM, and MIPS. Device emulation or passthrough is supported 
through virtual machine monitors running in user space. However, faithful virtu- 
alization is not the only way to run a legacy OS on top of the L4Re microkernel. 
L‘Linux is a paravirtualized Linux kernel that has been adapted to run on the 
interfaces provided by L4Re. It is binary compatible with standard Linux programs, 
however, instead of running in the privileged mode of the CPU, the L*Linux kernel 
runs as a multi-threaded user-level program. Linux user processes run in their own 
L4 tasks (i.e., other address spaces). Linux programs on L*Linux experience the 
same protection as on native Linux; they cannot read or write the Linux kernel’s 
memory and they are protected from each other like processes on native Linux. 


The vCPU Mechanism For execution, L Linux employs vCPUs, a mechanism 
provided by the microkernel that allows for an asynchronous execution model where 
a vCPU migrates between executing code in the L*Linux kernel and user code in 
Linux processes. For any event that needs to be handled by the Linux kernel, such 
as system calls and page faults by processes, or external interrupts by devices, the 
vCPU switches to the L Linux kernel to handle them. 
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L'Linux Process Model [*Linux manages the address spaces of Linux user 
processes through Task objects provided by the L4Re microkernel. Thus, every 
Linux process and the contents of its address space are known to the microkernel. 
Furthermore, L*Linux multiplexes all user-level threads executing in such an address 
space onto its vCPUs. Thus, the L4Re microkernel is involved in every context 
switch of any Linux user thread. In particular, it is responsible for forwarding any 
exceptions raised by a Linux program to the L*Linux kernel. Exceptions occur when 
a thread makes a system call, when a page fault occurs during its execution, or 
when a hardware device signals an interrupt. L+Linux receives these exceptions at 
a previously registered vCPU entry point, to which the microkernel switches the 
control flow when it migrates the vCPU from the Task of the faulting Linux user 
program to the address space of the virtualized Linux kernel. 


2.3 Decoupled Execution on Linux 


Since virtually all HPC codes are developed for Linux and require so many of its 
APIs, the only practical option is to execute them on a Linux-based OS. Just running 
them on L*Linux yields no benefit. However, the tight interaction between the L4Re 
microkernel and L*Linux allows us to conveniently implement a new mechanism 
we call decoupling [33]. 


Decoupling Thread Execution from L'Linux The purpose of decoupling is to 
separate execution of a thread in a Linux process from the vCPU it is normally 
running on. To this end, we create a separate, native L4Re host thread that runs in 
the same L4 Task (i.e., address space) as the Linux process, but not under control 
of L*Linux. The original Linux thread context in the L+Linux kernel is suspended 
while execution progresses on the native L4Re host thread. The user code running 
there will raise exceptions just as if it were executed by a vCPU, except that the 
microkernel forwards each of them to L*Linux as an exception IPC message. A 
message of this type carries a thread’s register state and fault information as its 
payload, and is delivered by the microkernel to an exception handler. We configure 
L*Linux to be the exception handler of the “decoupled” Linux user threads. An 
attempt to perform a Linux system call will also result in an exception, which the 
L*Linux kernel can then handle by briefly reactivating the previously suspended 
thread context and scheduling it onto a vCPU. Afterwards, execution is resumed in 
decoupled mode on the L4Re host thread. Figure 3 visualizes decoupled execution; 
more details can be found our publications [33, 60]. 


One Mechanism for the Best of Both Worlds The net gain of the decoupling 
mechanism is that we can combine noise-free execution on our LWK (i.e., the 
L4Re microkernel) with the rich execution environment of Linux, including all 
its APIs and the HPC infrastructure built for it. Decoupled threads use a single 
mechanism for forwarding any system call or exception, instead of many specialized 
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Fig. 3 Schematic view of the decoupling mechanism. The L4Re microkernel runs on every core 
of the system, while the virtualized L*Linux runs on a subset of those cores only. All normal Linux 
applications are thus restricted to those cores. Decoupling pulls off threads and runs them on cores 
not available to L+Linux 


proxies that other multi-kernel HPC OSes use and that are difficult to maintain [60]. 
Applications are built for Linux and start running as Linux processes, but we pull 
their threads out of Linux’s scheduling regime so they can run on dedicated cores 
without being disturbed by LLinux. Effectively, decoupled threads run directly on 
the microkernel. However, they can use all services provided by L+Linux, which 
will continue to handle Linux system calls and resolve page faults. Also, since the 
InfiniBand driver in the L*Linux kernel maps the I/O registers of the HCA into 
the address space of each MPI rank, the high performance and minimal latency of 
the user-space part of the driver is not impaired; a decoupled thread can program 
performance-critical operations just like it would on native Linux. 


CPU Resource Control The number of vCPUs assigned to L*Linux and their 
pinning to physical CPU cores determines how much hardware parallelism an 
L*Linux-based virtual machine can use. All other cores are exclusively under control 
of the L4Re microkernel and can therefore be allocated exclusively to decoupled 
threads of HPC application processes. 


Initial Benchmark We used the fixed-work quantum (FWQ) [35] benchmark to 
determine how much “OS noise” decoupled threads experience. FWQ executes a 
fixed amount of work in a loop, which on a perfectly noise-free system should 
require a constant amount of CPU cycles to complete. Figure 4 shows the result 
of our first benchmark run, performed on a 2-socket machine from our lab. Using 
the rdtsc instruction, we measured delays of up to 55 CPU cycles per iteration 
when FWQ is executed by a decoupled thread on a dedicated core. The execution- 
time jitter is reduced to 4 cycles per iteration, when FWQ is offloaded to the second 
socket, while L+Linux is pinned to a single core of socket 1. 
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Fig. 4 Minimal OS noise remaining in decoupled execution; L*Linux running on same socket 


Multi-Node Benchmark The execution-time jitter we measured in the lab is four 
to fives orders of magnitude smaller than what we saw when FWQ ran on the 
vendor-provided Linux OS of Taurus, an HPC cluster installed at TU Dresden (see 
Fig.2 on page 486). For larger-scale benchmarks, we extended FWQ into MPI- 
FWQ, a parallel benchmark that executes work-loop iterations in each MPI process 
on all participating cores. We performed experiments with our L4Re/I4Linux-based 
node OS on 50 Taurus nodes. Each node has two Xeon® E5-2690 processors 
with 8 cores per socket. To benchmark “decoupling” in a parallel application, 
we allocated one core to L*Linux and the remaining 15 cores to MPI-FWQ. The 
baseline we compare against is 15 MPI-FWQ processes per node scheduled by 
the same L*Linux on the same hardware, but in a 16-vCPU configuration with no 
decoupled threads. 


EP and BSP Runs MPI-FWQ can operate in two modes: In StartSync mode, a 
single barrier across all ranks is used to synchronize once when the benchmark 
starts; this mode simulates an embarrassingly parallel (EP) application. In StepSync 
mode, MPI-FWQ waits on a barrier after each iteration of the work loop, thereby 
acting like an application based on the bulk-synchronous programming (BSP) 
model. 


Figure 5 shows the time to completion for any MPI-FWQ process operating in 
BSP-style StepSync mode, as we increased the total number of MPI processes (i.e., 
cores) from 30 to 750. The benchmark run time with decoupled MPI-FWQ threads 
(L4Linux-DC) is approximately 1% shorter than when the standard scheduler in the 
L*Linux kernel controlled application threads (L4Linux-Std); results for EP-style 
StartSync runs show a similar performance. This difference is smaller than the up 
to 9% jitter we saw with the vendor OS, but our stripped down Linux environment 
lacks most of the management services and system daemons that run on the cluster. 
These services could never preempt decoupled threads, though. 

More information on the decoupling mechanism, additional use-cases, and 
evaluations results can be found in separate publications [32-34]. 
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Fig. 5 BSP-style MPI-FWQ (StepSync mode) on L*Linux (Std) and with decoupled thread 
execution (DC) on Taurus. This figure has originally been published in [33] 


2.4 Hardware Performance Variation 


Software-induced imbalance in large-scale parallel simulations have also been 
studied by other groups, who proposed multi-kernel architectures as well [20, 23, 
30, 48, 50, 51]. However, one source of variability has not been systematically 
investigated so far: the hardware itself. Hardware performance variation is partic- 
ularly interesting due to growing diversity of platforms in HPC and the increasing 
complexity of computer architectures in general. 


Measuring Hardware Performance Variation Characterizing hardware perfor- 
mance variability is challenging, because it requires a tightly controlled software 
environment. LWKs [51] have the greatest potential to obtain a precise charac- 
terization of various aspects of hardware performance variability on real HPC 
hardware. Towards this end, we have developed an extensible benchmarking 
framework to systematically characterize different aspects of hardware performance 
variability [61]. We use this benchmark suite to analyze five platforms described 
in Table 1: Intel Xeon [29], Intel Xeon Phi [56], Cavium ThunderX [9], Fujitsu 
FX100 [64] and IBM BlueGene/Q [28]. To minimize “OS noise”, we benchmarked 
on OSes based on two LWKs: CNK on the IBM BG/Q and IHK/McKernel [20, 55] 
on the Intel, Fujitsu, and Cavium machines. 


Benchmark Suite In addition to the previously described FWQ benchmark, our 
benchmark suite consists of seven other benchmark kernels. They were selected 
from well-known algorithms, micro benchmarks, or proxy applications and have 
the following characteristics: 


¢ The DGEMM benchmark performs matrix multiplication. We confine ourselves 
to naive matrix multiplication algorithms to allow compilers to emit SIMD 
instructions, if possible. This benchmark kernel is intended to measure hardware 
performance variation for double-precision floating point and vector operations. 
The SHA algorithm utilizes integer execution units instead. 
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Using John McCalpin’s STREAM benchmark, we assess variability in the cache 
and memory subsystems. The Capacity benchmark is intended to measure 
performance variation of cache misses themselves. 

HACCmk from the CORAL benchmark suite is a compute-intensive N-body 
simulation kernel with regular memory accesses. HPCCG from Mantevo’s 
benchmark suite is a Mini-App aimed at exhibiting the performance properties 
of real-world physics codes working on unstructured grid problems. MiniFE 
is another proxy application for unstructured implicit finite-element codes from 
Mantevo’s package. 


More detailed descriptions of these benchmark kernels, modifications we made, and 
all measurement results can be found in our paper [61]. In this report, we highlight 
only a few key results from FWQ, HPCCG, and DGEMM experiments. 


Workload Matters Like previous studies on software-induced performance vari- 
ation, we relied on the FWQ benchmark to evaluate execution-time jitter for 
decoupled threads. However, this simple benchmark kernel, may be suitable to 
quantify interruptions caused by system software, but they are insufficient to 
capture hardware-induced noise. We hypothesize that the full extent of hardware 
performance variation can only be observed when the resources which cause these 
variations are actually used. And indeed, as shown in Fig.6, the HPCCG proxy 
code running on IHK/McKernel on an Intel Ivy Bridge system shows about 1% of 
performance variation among all cores of a node. 


We measured variation on 30 SMT cores of this 2-socket Intel Ivy Bridge E5- 
2650 v2 system for both FWQ and HPCCG. We set the working set size of HPCCG 
to 70% of the L1 data cache size (32 KiB), disabled TurboBoost, set the scaling 
governor to performance, and fixed the clock speed to the nominal frequency of 
2.6GHz. We additionally sampled the performance counters for L1 data cache 
and L1 instruction cache misses and confirmed that both benchmarks experience 
little to no misses, in particular cores one to seven and 16 to 29 experience neither 
instruction cache nor data cache misses under HPCCG. Nevertheless all cores show 
significantly more variation under HPCCG than under FWQ. We conclude that 
FWQ is indeed ill-suited to measure hardware-induced performance variation. 
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Fig. 6 Performance variation of FWQ and HPCCG on a dual-socket Intel E5-2650 v2 
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Fig. 7 Hardware performance variation under the DGEMM benchmark 


Microarchitectures Differ Figure 7 visualizes our measurements for the DGEMM 
benchmark, which is dominated by floating point operations. We observe that the 
FX and ThunderX platforms exhibit very low variation, and note the rather high 
variation of the Ivy Bridge, KNL and BlueGene/Q platforms. We saw high numbers 
of cache misses on the Ivy Bridge platforms and therefore reduced the cache 
pressure to 70% fill level. After this modification to the benchmark setup, we saw 
stable or even zero cache miss numbers for all cores of the Ivy Bridge platform, but 
variation did not improve. We conclude that the measured variation is not caused by 
the memory subsystem. 


Overall we found that just focusing on CPU core-local resources, like we did 
in this study, already shows up to six orders of magnitude difference in relative 
variation among different CPU architectures. 


2.5 Scalable Diffusion-Based Load Balancing 


A low-noise execution environment is essential for certain types of applications, 
as we explained in the preceding subsections. However, there are other HPC codes 
where noise is less of a concern, because they suffer from load imbalances that are 
inherent to the problem domain. For example, they might use multiple simulation 
kernels with different computational complexity. Furthermore, some computations 
are difficult to parallelize efficiently, because partitioning of the problem space 
is non-trivial; it might even change dynamically as the simulation progresses. As 
mentioned as part of the architecture overview in Sect. 1, we believe that support for 
load balancing should be provided at the system level, thereby freeing application 
developers from the burden of managing a dynamic system. 


Taskifying MPI Applications may need to be (re-)balanced due to inherent load 
imbalances or to support shrinking and expanding the set of nodes assigned to 
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the application. In the Phase 1 report [59] of this project, we proposed to taskify 
MPI processes using oversubscription, which results in multiple Tasks (i.e., MPI 
processes) per core that can be migrated transparently without application code 
modifications. MosiX-like algorithms [2] can be used for this kind of system-level 
load balancing in case communication between application tasks is insignificant. In 
other cases, repartitioning methods that consider the task communication graph are 
required [57]. 


Requirements for Efficient Load Balancing In the context of our system architec- 
ture, the requirements for a load-balancing method are: (1) effective load balancing 
with low inter-node communication (edge-cut), (2) low amount of migration to 
reduce MPI process migration costs, and (3) low overhead of the method itself. The 
method’s input should be the local node’s view of the weighted task communication 
graph, which can be obtained by monitoring the application at the OS/MPI level. 
Since graph partitioners, like ParMetis [52], are known to be computationally 
expensive, we developed a nearest-neighbor diffusion-based repartitioning method. 


Method Description The method consists of four main phases that require point- 
to-point communication between neighbor nodes only and, with the exception of 
the flow calculation, have O(number of tasks per node) computational complexity. 


1. Each node determines its neighbor nodes from the task communication graph. 

2. Flow calculation: Computation of minimal load flows between neighbor nodes 
that lead to global balance using 2nd-order diffusion [13, 45]. The diffusion is 
stopped between each node pair individually if the load flow of an iteration falls 
below a specified threshold. The required number of iterations grows with the 
number of nodes. However, with low-latency networks the observed run time is 
within the low millisecond range even with thousands of nodes [39]. 

3. Task selection: Tasks at the partition border are selected for migration to realize 
workload flows using different weighted criteria to achieve low edge-cut [14]. 

4. Partition refinement: Edge-cut is improved with a parallel KL/FM-based 
refinement algorithm [58] that smoothes partition borders by swapping weighted 
tasks between neighbor node pairs independently within a certain imbalance tol- 
erance. If the task selection result was not within the tolerance, the optimization 
goal is “balance” instead of “edge-cut’. 


Evaluation Workloads We implemented the diffusion method within the Zoltan 
load balancing library [12] and evaluate the performance on Taurus in a normal, 
non-oversubscribed setup. MPI processes own multiple migratable user-level tasks. 
Two scenarios with 3D task meshes are used for performance evaluation: the Cloud 
scenario consists of computation time measurements from 36 x 36 x 48 grid cells 
over 100 time steps of COSMO-SPECS+FD4 [38] and the synthetic Shock scenario 
simulates the evolution of a spherical shock wave over a 160 x 80 x 80 grid with a 
four times increased workload at the wave front over 169 time steps. Figure 8 shows 
the workload distribution for a 2D version of the shock scenario with two examples 
of resulting partitionings. 
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Fig. 8 Left: workload distribution for a selected time step of the 2D shock scenario, red indicates 
4 times increased workload; Center: Resulting partitioning for 96 processes with ParMetis, colors 
represent individual partitions; Right: Resulting partitioning with Diffusion 


Diffusion Results We compare our method with four other (re)partitioning meth- 
ods: a hierarchical space-filling-curve method (FD4/SFC) [37], recursive coordinate 
bisection and space-filling curve from Zoltan (Zoltan/RCB and SFC), as well as 
AdaptiveRepart from ParMetis [52] with 0.5% imbalance tolerance. Note, that 
SFC and RCB are coordinate-based and not graph-based; they require application 
knowledge about spatial coordinates of tasks. Since diffusive load balancing 
requires a partitioning to exist, we use Zoltan/RCB to initialize the partitioning. 
Figure 9 shows the results for different metrics (lower is better), each averaged over 
the time steps (except median for run time): 


¢ Load Imbalance (max. load among processes/average load—I): Diffusion per- 
forms better than ParMetis, but worse than the three geometric methods. 

¢ Migration (max. no. of tasks a process imports and exports/avg. no. of tasks per 
process): Diffusion outperforms the other methods, especially with the Shock 
scenario (factor 3—10 less migration). 


Cloud Scenario Load imbalance Migration Edge-cut Run time (ms) 
0 4% 8% 0 40% 80% 0 50% 100% 0.1 1 10 100 


Number of procs: 


FD4/SFC 
Zoltan/SFC 
Zoltan/RCB 
ParMetis 
Diff tol=1% 
Diff tol=10% 


Shock Scenario Load imbalance Migration Edge-cut Run time (ms) 

0 3% 6% 0 100% 200% 0 40% 80% 0.1 1 10 100 1000 
FD4/SFC Nu ope 
Zoltan/SFC 
Zoltan/RCB 
ParMetis 
Diff tol=1% 
Diff tol=10% 


Fig. 9 Evaluation of diffusion-based load balancing with Cloud (62208 tasks) and Shock (ca. 1M 
tasks) scenarios. Bar shades show results for 96-768 and 384—6144 MPI processes, respectively 
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e Edge-cut (max. edge-cut among all processes/avg. number of edges per process): 
All methods achieve very similar results, except diffusion with low imbalance 
tolerance at high process counts (coarse granularity). 

e Run time of the method (max. among all processes): Due to its scalability, 
diffusion is clearly faster than the Zoltan methods at higher parallelism and 1- 
2 orders of magnitude faster than ParMetis. At 6144 processes, it requires 2.5 ms 
only. 


As can be seen, the imbalance tolerance of the diffusion method allows to 
trade-off load balance vs. edge-cut and migration. We can conclude that our load 
balancing method enables fast, scalable, and low-migration graph-based repartition- 
ing. 


2.6 Beyond L4: Improving Scalability with M? 
and SEMPEROS 


We conclude the performance and scalability section of this paper with an out- 
look on a new kernel architecture suitable for heterogeneous many-core systems: 
SEMPEROS [25]. 


M? SEMPEROS is based on M? [1], a hardware/software co-designed system archi- 
tecture. Like L4Re, which we described in Sect. 2.2, these systems are microkernel 
OSes that manage access rights to all system resources based on capabilities. A 
process can only use a resource, if it owns a capability to it. Such resources include 
threads, memory allocations, and files, but they are also used to grant and revoke 
access to CPU cores and the ability to send messages between the threads running 
on these cores. A key aspect of the M? design is that the OS kernel runs on just one 
core, which remotely manages all other cores? in the system. 


Heterogeneous Architectures Since it is not necessary that all cores of a system 
can run an OS kernel, M? is suitable for heterogeneous system architectures with 
different kinds of CPU cores. The current No. 1 system in the TOP500 list of 
supercomputers, China’s Sunway TaihuLight system [19], is based on such an 
architecture. Each node has “big” cores capable of running an OS kernel, as well 
as many small compute cores that are optimized for computation, but lack the 
architectural support for an OS kernel. 


SEMPEROS To put hundreds of cores under the control of a microkernel OS, the 
kernel and its capability system need to scale. SEMPEROS extends M? to manage the 


?M? provides a hardware abstraction to integrate accelerators in the same way as general-purpose 


cores. Therefore, we usually use the term processing element in an M? context. HPC workloads can 
benefit from generalized accelerator support, but it does not influence how the capability system 
works. In this paper, we therefore use the common term cores to simplify the explanation. 
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system using multiple kernels. The compute cores of the system are organized into 
as many partitions as there are kernel instances, thereby increasing the total number 
of cores (and compute threads) the system can handle. Each of the SEMPEROS 
kernel instances executes on its own privileged core, but they coordinate with each 
other via a distributed capability protocol [25]. The collaboration between cores 
managed by different kernels is transparent to the applications. 


Capability Model SEMPEROS implements partitioned capabilities. The capabili- 
ties are stored within a protected address space and the kernel supervises capability 
operations. Application processes can delegate and revoke capabilities via the 
respective system calls. The kernel records the delegations in a capability tree that 
stores the relations between capabilities. Using this capability tree, SEMPEROS 
implements recursive revocation, by which all access rights that originated from the 
specific capability can be revoked by deleting this capability and all its descendants. 


Distributed Capability Protocol In a distributed capability system, the data 
structure storing the capability tree is split between multiple kernels. Because the 
same capabilities can be manipulated simultaneously by different kernels in the 
system, SEMPEROS kernel instances need to coordinate certain capability opera- 
tions. We analyzed all possible interleavings of capability operations (e.g., during 
delegation or revocation) and developed a protocol that prevents inconsistencies in 
the capability tree. This protocol integrates a confirmation for capability delegation 
(similar to a two-way handshake) and a distributed mark-and-sweep algorithm to 
revoke capabilities [25]. 


Evaluation Setup We evaluated SEMPEROS using the cycle-accurate gem5 sim- 
ulator [6]. The experiments were run with up to 640 out-of-order cores integrated 
into a single network-on-chip. The applications used in our evaluation stress the 
capability system, as they extensively use OS services. In particular, they make 
heavy use of the M?/SEMPEROS in-memory file system, which grants access to 
memory ranges within files by delegating memory capabilities to applications and 
revoking those capabilities when files are closed again. We assume that similar usage 
patterns would also occur during checkpointing operations on a future HPC system 
with storage-class memory in each node. 


Scalability Results To quantify the scalability of SEMPEROS, we computed the 
parallel efficiency, exposing the runtime overhead a single benchmark instance (i.e. 
process) experiences when it is executed in parallel with a number of identical 
benchmark instances. Figure 10a depicts an overview of the parallel efficiency of 
all application benchmarks we evaluated. The scalability of the applications improve 
when increasing the number of kernels managing the system as depicted in Fig. 10b. 


The capability protocol in general is applicable to any distributed capability 
system implementing global IDs and stores all capability relations in a capability 
tree. 
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Fig. 10 Scalability evaluation of SEMPEROS. (a) Parallel efficiency of application benchmarks. 
(b) Level DB key-value store 


3 Algorithms and System Support for Fault Tolerance 


In this section we discuss FFMK research results on fault tolerance. Our activities 
concentrated on two areas: First, we introduce new methods for resilience of MPI 
applications at scale. To this end, we developed probabilistic and deterministic 
algorithms for resilience of collective operations (Sects. 3.1 through 3.3). Second, 
we present new approaches to coordinate and optimize concurrent checkpointing 
of multiple jobs (Sect. 3.4) and to improve multi-level checkpointing for a single 
parallel job (Sect. 3.5). 


3.1 Resilience in MPI Collective Operations 


MPI libraries are usually not fault tolerant and therefore not able to complete a 
collective operation correctly in case of a fault. Instead, the whole application will 
either hang or crash. Unfortunately, the overall system reliability decreases as the 
number of processes involved in the computation grows. 


Existing Recovery Approaches The MPI standard does not address resilience 
of parallel applications, with the exception of return codes for detected errors. 
Currently, there are multiple outstanding proposals how to address faults during MPI 
run-time. Most work on this subject is characterized as either backward recovery or 
forward recovery. The former tries to restore a correct past state, whereas the latter 
attempts to establish a new correct state. For example, ULFM [8] applies forward 
recovery, entailing a set of tools that allow applications to deal with detected faults. 


Fault-Tolerant Algorithms Fault-Tolerant Collective Operations (FTCO) [43] is 
a new forward recovery approach. FTCO relies on parallel algorithms that apply to 
tree-based collective operations in MPI. It includes resilient versions of collective 
operations such as Bcast, Reduce and Allreduce, for any tree topology. This 
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Fig. 11 Message flow after 
DATA timeout: Node #3 
overcomes fault detected on 
node #1 by bypassing it and 
sending the data to the next 
node #0 along the tree 


algorithm detects faults by a per-node calculated timeout and overcomes faults by 
excluding failed processes. It is intended for use-cases that can tolerate process 
faults, such as Monte-Carlo method or PDE solvers. FTCO delivers a result to every 
live process, so that the application may keep running without handling those faults. 
The FTCO algorithm minimizes the fault-free performance penalty and shows a 
small increase in latency and messages per fault, regardless of job size. This offers a 
transparent and scalable forward recovery alternative to the costly legacy backward 
recovery mechanisms. For example, Fig. 11 demonstrates a simple case, where node 
#3 overcomes the fault detected on node #1 by bypassing it and sending the data to 
the next node (#0) along the tree. 


FTCO Results FTCO differs from other approaches, such as ULFM, by localizing 
the detection and recovery of faults, while other approaches involve the entire group 
of processes in the MPI job. Our experimental results with the FTCO approach 
support this claim, showing that the latency with FTCO is proportional to the 
number of serial faults, regardless of the number of MPI proceeses: Table 2 shows 
the FTCO latency of the Allreduce operation for different combinations of offline 
faults and increasing number of processes. We chose a timeout of 2 s, which is the 
default for ULFM [43]. In the table, each figure represents the average longest time 
(in seconds) for a process to complete the same Allreduce call. In a tree topology, 
multiple faults could be either serial or parallel: parallel faults occur in different sub- 
trees, thus their recovery time may overlap, while serial faults are handled one after 
the other. We found that two parallel faults take approximately the same amount 
of time as a single fault; adding a third parallel fault to two serial faults does not 
change the latency. 


Table 2 FTCO: average longest time in seconds for a process to complete Allreduce call 


No. of process No faults | 1 fault |2 parallel faults | 2 serial faults | 3 mixed faults 


64 0.0005 2.0095 | 2.0094 4.0194 4.0194 
128 0.0006 2.0123 | 2.0123 4.0221 4.0222 
256 0.0007 2.0121 | 2.0123 4.0220 4.0219 
384 0.0017 2.0114 | 2.0323 4.0404 4.0217 


512 | 0.0024 2.0343 | 2.0110 4.0421 4.0422 
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Fig. 12 The latency cost of one fault (in seconds) with FTCO and ULFM: FTCO outperforms 
ULFM; both approaches use the same fault-detection timeouts and tree topology 


FTCO vs ULFM Figure 12 shows a comparison between the latency of FTCO and 
ULFM for the Allreduce operation with a single fault and an increasing number of 
processes. In both cases we used the same parameters, like fault detection timeouts 
and tree topology. Each test started by injecting a fault into one process and then 
attempted an Allreduce on multiple communicators with that process. 


Integration of FTCO into UCX We developed a prototype library to further 
measure the performance of FTCO based on UCX. UCX [54] is an open-source 
point-to-point communication library, optimized for performance on low-latency 
interconnects. UCX is designed to provide a high-level of abstraction for com- 
munication, and to consider the specifics of the local NIC, to find the optimal 
send and receive methods. UCX queries the capabilities of each NIC to establish 
which hardware accelerations are present, and chooses the optimal parameters 
for sending messages, including the NIC, port and protocol (e.g. rendezvous). In 
order to apply fault-tolerance algorithms to MPI applications, we extended UCX 
with an implementation of MPI’s collective operations. Our library is suitable 
for applications that can withstand partial failures, where the overall result is not 
effected by some faults, or where the application can overcome them. 


Our extension contains a basic, deterministic implementation for fault-free 
collective operations. It demonstrated the benefit of a persistent collective operation: 
applications often repeat the same collective operation call, and so the library can 
reuse past structures instead of creating it from scratch for each call. Our UCX- 
based library can be used with any MPI implementation, and provides a foundation 
for further experimental results and optimizations. 
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3.2 Corrected Gossip Algorithms 


An alternative to the deterministic approach taken with FTCO is to use randomized 
gossip-based algorithms, which have been shown to yield better recovery latency 
for some applications. Gossip algorithms have been widely successful in various 
contexts that did not require strong consistency. Yet, they become rapidly inefficient 
once about 50% of the nodes were reached, because messages are more likely to be 
sent to nodes that have already received a message before. 


Gossip-Based Broadcast Let us consider how broadcast among processes can be 
implemented using gossip. The root sends the broadcast payload to a random subset 
of other processes. When a new process receives the payload and gets colored, it 
starts sending messages to randomly selected processes as well. This dissemination 
runs for a fixed period of time, after which all colored processes have received the 
broadcast payload. 


Closing the Gaps The gossip phase attempts to color as many processes as 
possible, but due to potential failures and the random nature of gossip, the protocol 
cannot guarantee that all live processes are actually colored. Our new algorithm 
therefore enters a deterministic correction phase, in which it tries to color these 
remaining processes. For correction, all processes are reorganized into a virtual ring 
(e.g., according to their MPI rank numbers from 0 to P — 1 for P processes). On this 
ring, uncolored processes create gaps, where the maximum gap size is the length 
of the longest sequence of uncolored processes. All colored processes now send 
messages to their neighbors on the ring, thereby closing the gaps with few messages 
per node. 


Corrected Gossip Algorithms We designed three different protocols based on 
this idea of combining randomized and deterministic algorithms for improving the 
broadcast latency [27]. These three algorithms allow us to choose various tradeoffs 
between consistency, simplicity, and performance. The three algorithms are: (1) 
opportunistic, which applies the correction without checking for completion; (2) 
checked, which runs the correction until all nodes received the message, provided 
that no nodes fail during the correction; and (3) fail-proof, which applies the 
correction and guarantees that all active nodes receive the message, provided 
that no more than f nodes fail during the operation. Our algorithms do not 
require multicast, failure detectors, timeouts, acknowledgments, or reconfiguration 
procedures. The result of this work is the “corrected-gossip” paradigm [27]. 


Framework for Collective Algorithms Based on the corrected-gossip paradigm, 
we also developed a framework for failure-proof collective operations that generates 
online an independent spanning tree. Generation can be completed successfully 
even with an arbitrary number of active nodes and up to f online failures. Based 
on the system’s mean time between failures (MTBF), an appropriate value f 
could be chosen as the maximal number of faults supported by the algorithm. 
Compared to alternative methods for fault recovery, this approach allows a trivial 
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recovery procedure, provided that sufficient spanning trees are maintained during 
the application run. 


3.3 Corrected Tree Algorithms 


The reliability properties of all correction schemes of “corrected gossip” have been 
proven [27]. However, due to being probabilistic, the gossip algorithm used in 
the dissemination phase needs to send more messages than an optimal tree-based 
broadcast algorithm in order to color nodes. Tree-based dissemination will, however, 
miss a large number of processes, if any process close to the tree’s root fails. In 
general, failure of any non-leaf process in the tree results in all its descendants 
remaining uncolored. 


Can We Save the Trees? We developed a corrected tree algorithm which splits 
communication among processes into two phases, exactly like we did with corrected 
gossip: dissemination and correction. The idea is to correct the result of a failed tree- 
based broadcast during the dissemination phase. With failure-proof correction, full 
coloring can be guaranteed even if processes fail during the broadcast [27]. 


Requirements for Corrected Trees The goal of the broadcast is to guarantee 
information propagation from the root process to every live process, even if some 
processes fail. In a broadcast operation among P processes, the root process 
propagates a message reliably to all other processes. Without loss of generality we 
assume the root process to have rank 0, other processes have ranks 1,..., P — 1. 
To ensure that correction can color uncolored processes quickly, the maximum gap 
size ought to be small. A tree maintaining such a property should have its subtrees 
spread across the correction ring to avoid the danger of having uncolored processes 
cluster together on the ring. For lowering correction latency, multiple small gaps are 
better than few large ones. 


Interleaved Trees The key idea behind corrected tree algorithms is that nodes of 
the tree can be renumbered in such a way that parent and child nodes do not become 
neighbors on the ring. Instead, nodes from a subtree below a failed node shall always 
have a close neighbor from outside their own subtree, so the they can be colored in 
the correction phase. In this paper, we only describe how to interleave k-ary trees. 
However, the scheme is also available for Lamé trees, which include Binomial trees. 
A more detailed description is given in the original publication [31]. 


Interleaving k-ary Trees Given the root process at level 0, a full k-ary tree has 
k! processes at level £, and k subtrees that have their root process at that level. The 
processes in these subtrees can have a distance of kf in the ring. Process r has the 
child processes r’: 


{r |r’ =rti-k’,0<i<k,r' <P} 
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This interleaving ensures that a failing process on level £ leads to every k-th process 
being uncolored. Thus, f = k — 1 failures on level £ or below can be tolerated, and 
still every k-th process will be colored after the dissemination. 


Correction Phase The correction phase follows the dissemination phase and 
is independent of the tree type. It ensures coloring of the processes that the 
dissemination phase left uncolored due to failures. All three correction algorithms 
developed for corrected gossip [27] are directly applicable. 


Simulation Results In our simulation-based evaluation, we used the same fail-stop 
fault model as in the corrected-gossip publication [27]. During broadcast, every 
process is either dead or alive. A process either sends all messages required by 
the protocol or none at all, but failures can occur anywhere outside of the broadcast 
operation. The simulation is based on the LogP-model [10]. The graph in Fig. 13 
shows the number of messages sent for different numbers of faults. All tree-based 
broadcasts send fewer messages than corrected gossip and are thus more efficient 
with regard to this metric. 


Latency Measurements To measure the average latency of a broadcast, we 
implemented all algorithms using MPI and ran them on the Piz Daint supercomputer 
installed at ETH Zurich. The results shown in Fig. 14 therefore include overhead 
due to the physical properties of the interconnect. Note that there are no faults 
in this experiment, so we can compare our implementations with the broadcast 
implementation from Cray. Since our MPI-based implementation cannot use the 
shared-memory optimization the Cray algorithm uses, we include performance 
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Fig. 13 Average number of messages sent per process in presence of failures; four different 


corrected tree algorithms and corrected gossip were simulated with varying percentage of failed 
processes 
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Fig. 14 Broadcast median latency of corrected tree, corrected gossip, and vendor-provided 
implementations of broadcast without any failures 


of the Cray algorithm with shared memory disabled for reference (green line in 
diagram). 


We find Corrected Trees to be a simple, yet efficient protocol for fault-tolerant 
collective group communication. A more detailed evaluation can be found in the 
publication [31]. 


3.4 Checkpointing Scheduling 


Only some applications can benefit from forward recovery by tolerating process 
or node failures. All other HPC codes rely on system-level or application-assisted 
fault-tolerance services. Periodic checkpoint/restart (C/R) [11, 65] is an effective 
mechanism to alleviate failures during execution of HPC applications, which often 
require a vast number of nodes for a relatively long time [15]. The overhead of 
checkpointing should be kept as small as possible, as typically the computation has 
to be paused during a checkpoint write to store a consistent state of the application. 
Furthermore, on a large supercomputer, which is shared by several parallel jobs, 
checkpointing should be coordinated between applications to avoid performance 
bottlenecks. Otherwise, as the storage is typically shared between all applications, 
concurrent checkpoint writes would unnecessarily delay the computation tasks. 


Uncoordinated Checkpointing Hurts As part of the FFMK project, we designed 
a mechanism to coordinate concurrent checkpoints of large applications running on 
a supercomputer and sharing a dedicated burst-buffer [42] or parallel file system 
(PFS) [41] for checkpointing [22]. We assume that the system executes multiple 
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parallel applications at the same time and that it provides a single shared PFS 
(e.g., Lustre, GPFS, ...) for all applications. Typically, each application uses fixed 
checkpoint intervals calculated independently of the other applications based on 
Daly’s method [11]. However, Daly assumes a constant checkpoint write duration 
during the application’s lifetime, whereas in case of concurrent checkpointing of 
different applications checkpoint costs vary and depend on other applications’ 
activities due to limited write bandwidth and other interferences. 


Checkpoint Scheduling Approach In our work, we take this variation into 
account and rearrange the checkpoints of jobs, by predicting conflicts, in order 
to minimize the overall system resource usage. This is achieved by scheduling 
checkpoints earlier or later than originally planned. The decision whether to 
postpone a checkpoint to another timeslot is based on a constructed optimization 
problem for the predicted conflicts. By minimizing the constructed problem, non- 
overlapping timeslots to write checkpoints are found and applied. 


Checkpoint Scheduling Architecture While system-level checkpointing takes 
advantage of system-wide information not visible to jobs such as mean-time to 
failure (MTTF), bandwidth, or other jobs’ activities, it suffers from unnecessarily 
large checkpoints (higher checkpoint costs). On the other hand, user-defined 
checkpointing is done within the application which uses a runtime library for 
checkpointing. The library remembers the important data to be saved and also knows 
at which time checkpoints should be written (application-level information). This 
way the approach reduces the checkpoint size while lacking the system-wide infor- 
mation. In our work, we combined system-level with user-defined checkpointing 
in order to leverage the advantages of both methods [22]. The coordination among 
different jobs is performed by a coordinator service running on the system side. The 
service has access to system-wide information like MTTF, the bandwidth of shared 
storage systems, and the activities of other jobs (system-level checkpointing). 


User-Level Library Support To achieve this, we modified the SCR library [44] 
to provide it with the ability to interact with our coordinator service, which 
instructs SCR when to perform checkpoint requests. The application uses the library 
to write checkpoints (user-level checkpointing). This library cooperates with the 
coordinator-service, whereby whenever a checkpoint call is made by the application 
(which is done frequently and periodically), a request is sent to the coordinator and 
if accepted, the write will be performed by the library. Otherwise, the checkpoint 
call will be ignored and the application continues with the computation. 


System-Level Coordinator The decisions on the coordinator side are based on 
solved optimization problems constructed for the predicted conflicts. Then the 
incoming checkpoint requests are either accepted or rejected based on the computed 
decisions to get optimal checkpoint times. 


Figure 15 shows that we are able to reduce the C/R overhead by up to 20% by 
coordinating checkpoints compared to state-of-the-art approaches. 
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Fig. 15 Improvement rate of C/R overhead 


3.5 Multi-Level Checkpoint/Restart 


To further strengthen the C/R mechanism of the FFMK OS platform, multi-level 
checkpoint/restart is used as a solution to address a wide range of failures of 
different severity occurring in large supercomputers [21]. Severe failures require 
application state to be recovered from the most stable storage devices. However, 
this comes at high cost, as stable storage devices are slower to write to than transient 
storage. This is the key motivation of using multi-level checkpoint/restart. 


Component Failure Rates To fully grasp the multi-level hierarchy of failures and 
storage devices, we constructed a comprehensive model of a typical large cluster 
containing nodes, local storages (ramdisk, SSD), network switches, power supplies, 
different shared storage systems (burst-buffer, PFS), etc. Additionally, we modeled 
5 different levels of checkpoints, all of which are supported by the SCR [44] library 
and most of them are also available on other checkpointing libraries as well (FTI [4], 
VeloC [46]): 


e Local Level: The first checkpoint level is the node’s local storage (ramdisk, SSD). 
While this level benefits from the lowest checkpointing cost among all levels, it 
cannot survive fatal faults where the node becomes unavailable. 
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e XOR Level: The next level is the XOR level, at which for each node a 
parity segment is computed and distributed among a set of nodes (XOR group) 
according to [24, 47]. Each node stores its local checkpoint along with the 
XOR parity data. This level survives a single node of an XOR group becoming 
unavailable. 

e Partner Level: A more stable level is the partner scheme, where the checkpoint 
is stored in the node’s local storage along with the partner node’s storage (two 
copies). This level fails to restart the job, if a node and its partner fail together. 

e Shared Levels: The most stable levels use shared storage (among all compute 
nodes), namely the burst buffer and the parallel file system. However, they provide 
lower checkpointing bandwidth per node, because many (or all) nodes may access 
them concurrently. 


Modeling Failures Per Component To have a proper estimation of the stability 
of different checkpoint levels (i.e., how often they fail), we studied a wide range 
of possible failures occurring on supercomputers and investigated their effect on 
each checkpoint level. Using the estimations, we defined the optimum checkpoint 
intervals for each level minimizing the application’s lifetime. To study different 
types of failures, we differentiated light faults and fatal failures, where after a light 
fault the node stays alive and the job can be restarted from the local storage device. 
A fatal failure makes a node or a set of nodes unavailable (node down). In this case, 
the job must be restarted from more stable levels (XOR, partner, etc.). Additionally, 
we took correlations into account by arranging the nodes of a cluster in different 
correlated groups (network, power supply, etc.), whereby the nodes belonging to the 
same correlated group are vulnerable to failures of the same origin making all nodes 
in the group unavailable at the same time (e.g., a network switch outage). Figure 16 
visualizes, how this correlation of nodes translates into a correlation graph. 


In addition to the compute nodes, we considered I/O nodes (burst-buffer) which 
are also vulnerable to failures making the burst-buffer unavailable for a restart. 
Although such faults do not interrupt the job’s computation, losing checkpoints 
on the burst-buffer may impact the job’s lifetime. Another case of a checkpoint’s 
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Fig. 16 Correlation of nodes in a cluster 
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unavailability occurs, if due to a node down, the job is restarted from the XOR level 
and the new node allocated to the job lacks the partner level checkpoint. Hence, in 
case of another failure, an attempt to restore from partner-level will fail to recover 
the just restarted job a second time until the next partner checkpoint is written. 


Model-Based Adaptation of Checkpoint Behavior Finally, we introduced a 
mechanism to model the expected lifetime of applications writing multi-level 
checkpoints on a given set of checkpoint levels considering all investigated types 
of failures. Using this model, we properly choose the optimal set of checkpoint 
levels for each job minimizing the expected lifetime. 


Multi-Level Checkpoint Architecture Our implementation of multi-level check- 
point/restart consists of a system-level daemon and a library that is linked into 
the application. The daemon communicates with the library in the application 
using TCP sockets and directs the local checkpoints (written by the application) to 
different checkpoint levels transparent to the application. It computes the optimum 
checkpoint intervals of each level using system-wide information and application- 
specific parameters. For each checkpoint call made through the library, the daemon 
determines the optimal level of the checkpoint (or simply rejects it). We employ 
the following optimizations and automations in our design to further improve 
performance and reliability: 


¢ Asynchronous Checkpointing: To further reduce the checkpointing costs of the 
lower levels, checkpoint writes are performed asynchronously. The application 
writes its state to the node’s local storage, providing fast write performance. 
Thereafter, the operation of checkpointing at the lower levels (e.g., partner, burst 
buffer, PFS) is performed by the daemon in the background while the application 
returns to the computation. During the background operation of the daemon, 
every incoming request for checkpointing to the same level from the job is ignored 
until the persisting is finished. 

e Fast Recovery: To ensure the fast recovery in case of light failures (node alive), 
ULFM [7] is used to provide the capability of automatically recovering the failed 
ranks with the most recent checkpoint and reordering the ranks to the original 
arrangement, transparent to the developer, application, and the resource manager. 

e Job Life Cycle and Recovery When a new job is started, a short-lived controller 
service is executed by the first rank on each node (i.e., the node’s local rank 0). 
This service forks a daemon process for each rank on the node, connects them 
to the corresponding ranks, and then terminates. The per-rank daemons are 
responsible for sending status information and receiving instructions. The per- 
rank daemons stay alive until the application is finished with the computation. If 
a rank fails, its daemon terminates and the rank is later restarted using ULFM. 
When a rank is restarted, or when it notices that its daemon died, it will execute 
the previously mentioned controller service again, which will then reconnect a 
new daemon process with the rank. This is done transparently to the application 
and the existence of both daemons and controller services remains transparent to 
the user and the developer. 
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Rotating Partner Nodes: To further enhance the stability of the partner level, a 
novel approach will be engaged in which there are no fixed partners for the nodes, 
instead, the partner node of a specific node is changed on each turn and chosen by 
the daemon. This allows reducing the probability of unavailability of checkpoints 
at this level. In addition, partner checkpoints and data transfers will be performed 
by service daemons of both nodes using Remote Direct Memory Access (RDMA) 
in order to have fast partner checkpoints in the background. 


Our evaluations of the introduced multi-level checkpoint/restart model (the failure 
rates and optimal intervals) show that we managed to reduce the C/R overhead of 
jobs by up to 10% in the investigated cases (50,000 nodes, 3—5 levels, 6-240 GB 
checkpoints) compared to the state-of-the-art approach. Our observations indicate 
that the overhead is reduced further as the number of levels or the checkpoint size 
increases. 


4 Conclusions 


The FFMK project had the very ambitious goal of creating a new operating- 
system (OS) platform for Exascale computing. Unfortunately, we did not reach the 
milestone of a fully integrated prototype at the end of phase 2. However, several 
essential building blocks have been adapted for HPC and new algorithms and 
services have been developed to meet the challenges we expect from these upcoming 
extreme-scale systems. 


Multi-Kernel Node OS At the node level, the mature L4Re microkernel together 
with L*Linux enable the decoupled-thread model, which combines noise-free exe- 
cution and full Linux compatibility for applications. We studied and quantified the 
influence of hardware performance variation, a source of noise that still remained. 
Furthermore, we gave an outlook on how to manage OS-level resources at extreme 
scales of parallelism with a kernel architecture for scalable capability management. 


Dynamic Platform Management Although mostly an activity from the first three- 
year phase of the project, we continued research on dynamic platform management 
at the system level. We researched fault-tolerant gossip-based algorithms for 
obtaining load and health information from nodes. We integrated these information 
dissemination algorithms with a per-node management daemon and a distributed 
decision making module. However, one component of the FFMK architecture’s 
load balancing service remained elusive: migration of processes has been shown 
to work at small scale for simple InfiniBand-based programs, but robust C/R-based 
migration is not available yet. In Phase 2, we developed highly efficient diffusion- 
based load balancing algorithms; they complement the architecture and can already 
be used for application-level load balancing, as they have been integrated into a 
widely-used framework. 
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Fault Tolerance The main activities of the second phase of this project concen- 
trated on efficient and fault-tolerant communication, as well as system-level support 
for reducing checkpoint overhead. Fault-tolerant collective operations (FTCO) have 
been developed as a drop-in replacement for MPI libraries and were prototyped 
in the UCX communication library. As a potentially more efficient alternative, we 
introduced a new class of two-step algorithms based on either gossip or trees: 
Corrected Gossip and Corrected Trees enable inherently fault-tolerant collective 
operations for applications that can continue even after some processes failed. 
Globally optimized checkpoint scheduling and multi-level checkpointing minimize 
checkpoint/restart overhead for all other applications. The latter optimizes the 
cost of checkpointing based on a model that includes expected failure rates of 
components, a failure-correlation graph, and available write bandwidth across the 
storage hierarchy. 


Together, all newly developed algorithms, system services, and low-level OS 
components form the basis of a fast, scalable, and fault-tolerant HPC operating 
system for Exascale computing. 
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Abstract Atomistic simulations of large biomolecular systems with chemical 
variability such as constant pH dynamic protonation offer multiple challenges in 
high performance computing. One of them is the correct treatment of the involved 
electrostatics in an efficient and highly scalable way. Here we review and assess two 
of the main building blocks that will permit such simulations: (1) An electrostatics 
library based on the Fast Multipole Method (FMM) that treats local alternative 
charge distributions with minimal overhead, and (2) A A-dynamics module working 
in tandem with the FMM that enables various types of chemical transitions during 
the simulation. Our A-dynamics and FMM implementations do not rely on third- 
party libraries but are exclusively using C++ language features and they are 
tailored to the specific requirements of molecular dynamics simulation suites such 
as GROMACS. The FMM library supports fractional tree depths and allows for 
rigorous error control and automatic performance optimization at runtime. Near- 
optimal performance is achieved on various SIMD architectures and on GPUs 
using CUDA. For exascale systems, we expect our approach to outperform current 
implementations based on Particle Mesh Ewald (PME) electrostatics, because 
FMM avoids the communication bottlenecks caused by the parallel fast Fourier 
transformations needed for PME. 
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1 Introduction 


The majority of cellular function is carried out by biological nanomachines made 
of proteins. Ranging from transporters to enzymes, from motor to signalling 
proteins, conformational transitions are frequently at the core of protein function, 
which renders the detailed understanding of the involved dynamics indispensable. 
Experimentally, atomistic dynamics on submillisecond timescales are notoriously 
difficult to access, making computer simulations the method of choice. Molecular 
dynamics (MD) simulations of biomolecular systems are nowadays routinely used 
to study the mechanisms underlying biological function in atomic detail. Examples 
reach from membrane channels [28], microtubules [20], and whole ribosomes [4] to 
subcellular organelles [43]. Recently, the first MD simulation of an entire gene was 
reported, comprising about a billion of atoms [21]. 

Apart from system size, the scope of such simulations is limited by model 
accuracy and simulation length. Particularly the accurate treatment of electrostatic 
interactions is essential to properly describe a biomolecule’s functional motions. 
However, these interactions are numerically challenging for two reasons. 

First, their long-range character (the potential drops off slowly with 1% with 
distance r) renders traditional cut-off schemes prone to artifacts, such that grid- 
based Ewald summation methods were introduced to provide an accurate solution 
in 3D periodic boundaries. The current standard is the Particle Mesh Ewald (PME) 
method that makes use of fast Fourier transforms (FFTs) and scales as N - log N 
with the number of charges N [11]. However, when parallelizing PME over many 
compute nodes, the algorithm’s communication requirements become more limiting 
than the scaling with respect to N. Because of the involved FFTs, parallel PME 
requires multiple all-to-all communication steps per time step, in which the number 
of messages sent between p processes scales with p? [29]. For the PME algorithm 
included in the highly efficient, open source MD package GROMACS [42], much 
effort has been made to reduce as much as possible the all-to-all bottleneck, e.g. 
by partitioning the parallel computer in long-range and short-range processors, 
which reduces the number of messages involved in all-to-all communication [17]. 
Despite these efforts, however, even for multimillion atom MD systems on modern 
hardware, performance levels off beyond several thousand cores due to the inherent 
parallelization limitations of PME [30, 42, 45]. 

The second challenge is the tight and non-local coupling between the electrostatic 
potential and the location of charges on the protein, in particular titratable/protonat- 
able groups that adapt their total charge and potentially also their charge distribution 
to their current electrostatic environment. Hence, all protonation states are closely 
coupled, depend on pH, and therefore the protonation/deprotonation dynamics 
needs to be taken into account during the simulation. Whereas most MD simulations 
employ fixed protonation states for each titratable group, several dynamical schemes 
have been introduced [8, 13, 14, 23, 33, 37] that use a protonation coordinate à to 
distinguish the protonated from the deprotonated state. Here, we follow and expand 
the A-dynamics approach of Brooks et al. [27] and treat à as an additional degree 
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of freedom in the Hamiltonian with mass m,. Each protonatable group is associated 
with its own A “particle” that adopts continuous values in the interval [0, 1], where 
the end points around A = O and A = 1 correspond to the physical protonated 
or deprotonated states. A barrier potential with its maximum at A œ~ 0.5 serves 
two purposes. (1) It reduces the time spent in unphysical states, and (2) it allows 
to tune for optimal sampling of the A coordinate by adjusting its height [8, 9]. 
Current A-dynamics simulations with GROMACS are however limited to small 
system sizes with a small number n, of protonatable groups [7—9], as the existing, 
PME-based implementation (see www.mpibpc.mpg.de/grubmueller/constpH) needs 
an extra PME mesh evaluation per à group and suffers from the PME parallelization 
problem. While these extra PME evaluations can be overcome for the case where 
only the charges differ between the states, for the most general case of chemical 
alterations this is not possible. 

Without the PME parallelization limitations, a significantly higher number of 
compute nodes could be utilized, so that both larger and more realistic biomolecular 
systems would become accessible. The Fast Multipole Method [15] (FMM) is a 
method that by construction parallelizes much better than PME. Beyond that, the 
FMM can compute and communicate the additional multipole expansions that are 
required for the local charge alternatives of A groups with far less overhead as 
compared to the PME case. This makes the communicated volume (extra multipole 
components) somewhat larger, but no global communication steps are involved 
as in PME, where the global communication volume grows linearly with n} and 
quadratic with p. We also considered other methods that, like FMM, scale linearly 
with the number of charges, as e.g. multigrid methods. We decided in favor of 
FMM, because it showed better energy conservation and higher performance in a 
comparison study [2]. 

We will now introduce A-dynamics methods and related work to motivate the spe- 
cial requirements they have on the electrostatics solver. Then follows an overview of 
our FMM-based solver and the design decisions reflecting the specific needs of MD 
simulation. We will describe several of the algorithmical and hardware-exploiting 
features of the implementation such as error control, automatic performance tuning, 
the lightweight tasking engine, and the CUDA-based GPU implementation. 


2 Chemical Variability and Protonation Dynamics 


Classical MD simulations employ a Hamiltonian H that includes potential terms 
modeling the bonded interactions between pairs of atoms, the bond angle inter- 
actions between bonded atoms, and the van der Waals and Coulomb interactions 
between all pairs of atoms. For conventional, force field based MD simulations, the 
chemistry of molecules is fixed during a simulation because chemical changes are 
not described by established biomolecular force fields. Exceptions are alchemical 
transformations [36, 38, 46, 47], where the system is either driven from a state 
A described by Hamiltonian Ha to a slightly different state B (with Hp) via 
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a à parameter that increases linearly with time, or where A/B chimeric states 
are simulated at several fixed à values between A = 0 anda = 1, as e.g. in 
thermodynamic integration [24]. The A — B transition is described by a combined, 
A-dependent Hamiltonian 


Has A) = (1 — Ha + AMG. (1) 


In these simulations, which usually aim at determining the free energy difference 
between the A and B states, the value of À is an input parameter. 

In contrast, with A-dynamics [16, 25, 27], the à parameter is treated as an 
additional degree of freedom with mass m, whose 1D coordinate à and velocity À 
evolve dynamically during the simulation. Whereas in a normal MD simulation all 
protonation states are fixed, with A-dynamics, the pH value is fixed instead and the 
protonation state of a titratable group changes back and forth during the simulation 
in response to its local electrostatic environment [23, 39]. If two states (or forms) 
A and B are involved in the chemical transition, the corresponding Hamiltonian 
expands to 


HO) = (1— A)Hy + AH + 1/24? + Voias(A) (2) 


with a bias potential Vpjas that is calibrated to reflect the (experimentally determined) 
free energy difference between the A and B states and that optionally controls other 
properties relating to the A = B transitions [8]. With the potential energy part V of 
the Hamiltonian, the force acting on the A particle is 


f=- (3) 


If coupled to the protonated and deprotonated form of an amino acid side chain, 
e.g., A-dynamics enables dynamic protonation and deprotonation of this side chain 
in the simulation (see Fig. 1 for an example), accurately reacting to the electrostatic 
environment of the side chain. More generally, also alchemical transformations 
beyond protons are possible, as well as transformations involving more than just two 
forms A and B. Equation 2 shows the Hamiltonian for the simplest case of a single 
protonatable group with two forms A and B, but we have extended the framework to 
multiple protonatable groups using one A; parameter for each chemical form [7-9]. 


2.1 Variants of }-Dynamics and the Bias Potential 


The key aim of A-dynamics methods is to allow for dynamic protonation, but there 
are three areas in which the implementations differ from each other. These are the 
coordinate system used for A, the type of the applied bias potential, and how A is 
coupled to the alchemical transition. Before we discuss the different choices, let 
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Fig. 1 Simplified sketch of a protein (right, grey) in solution (blue) with several protonatable sites 
(ball-and-stick representations) of which a histidine (top left) and a carboxyl group (bottom left) 
are highlighted. The histidine site contains four forms (two neutral, two charged), whereas the 
carboxyl group contains three forms (two neutral, one negatively charged). In A-dynamics, the 
lambdas controls how much of each form is contributing to a site. Atom color coding: carbons- 
black, hydrogens/protons-white, oxygens-red, nitrogens-blue 


us define two terms used in the context of chemical variability and protonation. 
We use the term site for a part of a molecule that can interconvert between two or 
more chemically different states, e.g. the protonated and deprotonated forms of an 
aminoacid. Additionally, we call each of the chemically different states of a site a 
form. For instance, a protonatable group is a site with at least two forms A and B, a 
protonated form A and a deprotonated form B. 


2.1.1 The Coordinate System for A 


Based on the coordinate system in which A lives (or on the dynamical variables 
used to express à), we consider three variants of A-dynamics listed in Table 1. The 
linear variant is conceptually most straightforward, but it definitely needs a bias 
potential to constrain À to the interval [0..1]. The circular coordinate system for 
à used in the hypersphere variant automatically constrains the range of à values 
to the desired interval, however one needs to properly correct for the associated 
circle entropy [8]. The Nexp variant implicitly fulfils the constraints on the Nforms 
individual lambdas (Eq. 4) for sites that are allowed to transition between Nforms 
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Table 1 Three variants of A-dynamics are considered 


Variant name | Ref. | Dynamical variable | Geometric picture 


Linear [9] |à A lives on a constricted linear interval, e.g. [0..1] 
Hypersphere [8] | 6 A lives on a circle 
Brooks’ Nexp | [26] | No simple geometric interpretation 

S 

T 

g 

° 

zO 0 Noan] 

> A B 


0.0 0.5 1.0 0.0 0.5 1.0 


bias potential 
Q 
4 $ 
o 


0.0 0.5 1.0 0.0 05 1.0 
A A 


Fig. 2 Qualitative sketches of individual bias potentials (black) that fulfil some of the require- 
ments (1)-(5), and resulting equilibrium distributions of à values (green) 


different forms (Nforms = 2 in the case of simple protonation), such that no 
additional constraint solver for the 4; is needed. 


2.1.2 The Bias Potential 


The bias potential Vpias(A) that acts on à fulfils one or more of the following tasks. 


1. 


2. 
. It takes care that the resulting à values cluster around 0 or 1, suppressing values 


If needed, it limits the accessible values of À to the interval [0..1], whereas slight 
fluctuations outside that interval may be desirable (Fig. 2a). 
It cancels out any unwanted barrier at intermediate À values (b) 


between about 0.2 and 0.8 (c) 


. Itregulates the depth and width of the minima at 0 and 1, such that the resulting à 


distribution fits the experimental free energy difference between protonated and 
deprotonated form (c + d). 


. It allows to tune for optimal sampling of the à space by adjusting the barrier 


height at A = 0.5 (c) 


Taken together, the various contributions to the barrier potential might look like 


the example given in Fig. 3 for a particular A in a simulation. 
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Fig. 3 Qualitative sketch of a 
bias potential (black) that 
fulfils all requirements 
(1)+(5) with resulting 
equilibrium distribution of À 
values (green) 


bias potential 


2.1.3 How A Controls the Transition Between States 


The à parameter can either be coupled to the transition itself between two forms (as 
in [8, 9]), then A = 0 corresponds to form A and A = 1 to form B. Alternatively, 
each form gets assigned its own Ag with a € {A, B} as weight parameter. In the 
latter case one needs extra constraints on the weights similar to 


Yeah O< <l, (4) 


such that only one of the physical forms A or B is fully present at a time. For the 
examples mentioned so far, with just two forms, both approaches are equivalent and 
one would rather choose the first one, because it involves only one A and needs no 
extra constraints. 

If, however, a site can adopt more than two chemically different forms, the weight 
approach can become more convenient as it allows to treat sites with any number 
Nforms Of forms (using a number of Norms independent A parameters). Further, it 
does not require that the number of forms is a power of two (Norms = 2M2) as in 
the transition approach. 


2.2 Keeping the System Neutral with Buffer Sites 


In periodic boundary conditions as typically used in MD simulations, the electro- 
static energy is only defined for systems with a zero net charge. Therefore, if the 
charge of the MD system changes due to à mediated (de)protonation events, system 
neutrality has to be preserved. With PME, any net charge can be artificially removed 
by setting the respective Fourier mode’s coefficient to zero, so that also in these 
cases a value for the electrostatic energy can be computed. However, it is merely 
the energy of a similar system with a neutralizing background charge added. Severe 
simulation artifacts have been reported as side effects of this approach [19]. 

As an alternative, a charge buffer can be used that balances the net charge of the 
simulation system arising from fluctuating charge of the protonatable sites [9, 48]. 
A reduced number of npuffer buffer sites, each with a fractional charge |q| < le (e.g. 
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via H2O = H307*), was found to be sufficient to neutralize the Nites protonatable 
groups of a protein with nputter K Nites. The total charge of these buffer ions is 
coupled to the system’s net charge with a holonomic constraint [9]. The buffer sites 
should be placed sufficiently far from each other, such that their direct electrostatic 
interaction through the shielding solvent is negligible. 


3 A Modern FMM Implementation in C++ Tailored to MD 
Simulation 


High performance computing (HPC) biomolecular simulations differ from other 
scientific applications by their comparatively small particle numbers and by their 
extremely high iteration rates. With GROMACS, when approaching the scaling 
limit, the number of particles per CPU core typically lies in the order of a few 
hundred, whereas the wall-clock time required for computing one time step lies in 
the range of a millisecond or less [42]. In MD simulations with A-dynamics, the 
additional challenge arises to calculate the energy and forces from a Hamiltonian 
similar to Eq.2, but for N protonatable sites, in an efficient way. In addition to 
the Coulomb forces on the regular charged particles, the electrostatic solver has to 
compute the forces on the N À particles as well [8] via 


Ve OVC, .--, Ai-1, Ài, Ait, ÀN) 


fi = JA; 


= [VeA s Minas di = 1,Ai41,..-An) 
Vots -s Ain, Ai = 0, Aitas Aw) | (5) 


Accordingly, with A-dynamics, for each of the ;’s, the energies of the pure (i.e., 
A; = 0 and A; = 1) states have to be evaluated while keeping all other lambdas at 
their actual fractional values. 

The aforementioned requirements of biomolecular electrostatics have driven 
several design decisions in our C++ FMM, which is a completely new C++ 
reimplementation of the Fortran ScaFaCoS FMM [5]. Although several other FMM 
implementations exist [1, 50], none of them is prepared to compute the potential 
terms needed for biomolecular simulations with A-dynamics. 

Although our FMM is tailored for usage with GROMACS, it can be used as 
an electrostatics solver for other applications as well as it comes as a separate 
library in a distinct Git repository. On the GROMACS side we provide the necessary 
modifications such that FMM instead of PME can be chosen at run time. Apart 
from that, GROMACS calls our FMM library via an interface that can also be 
used by other codes. The development of this library follows three principles. 
First, the building blocks (i.e., data structures) used in the FMM support each level 
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of the hierarchical parallelism available on today’s hardware. Second, the library 
provides different implementations of the involved FMM operators depending on 
the underlying hardware. Third, the library optionally supports A-dynamics via an 
additional interface. 


3.1 The FMM ina Nutshell 


The FMM approximates and thereby speeds up the computation of the Coulomb 
potential Vc for a system of N charges: 


N 

Vo x >». idj (6) 
i ja lr; — rj] 

For that purpose, the FMM divides the simulation box into eight smaller boxes 
(depth d = 1), which are subsequently subdivided into eight smaller boxes again 
(d = 2) and again (d = 3, 4, . . .). The depth d refers to the number of subdivisions. 
On the lowermost level, i.e. for the smallest boxes (largest d), all interactions 
between neighboring boxes are directly calculated (these are called the near-field 
interactions). Interactions with boxes further away are approximated by a multipole 
expansion of order p (these are called the far-field interactions). A comprehensive 
description of the FMM algorithm is beyond the scope of this text, however we will 
shortly describe the basic workflow and the different operators used in the six FMM 
stages as these will be referred to in the following sections. For a detailed overview 
of the FMM, see [22]; for an introduction in our C++ FMM implementation see [12]. 


3.1.1 EMM Workflow 


The FMM algorithm consists of six different stages, five of them required for the 
farfield (FF) and one for the nearfield (NF) (Fig. 4). After setting up the FMM 
parameters tree depth (d) and multipole order (p), the following workflow is 
executed. 


1. P2M: Expand particles into spherical multipole moments œm up to order p on 
the lowest level for each box in the FMM tree. Multipole moments for particles in 
the same box can be summed into a multipole expansion representing the whole 
box. 

2. M2M: Translate the multipole expansion of each box to its parent box inside the 
tree. Again, multipole expansions with the same box center can be summed up. 
The translation up the tree is repeated until the root node is reached. 

3. M2L: Transform remote multipole moments @ım into local moments Hım for 
each box on every level. Only a limited number of interactions for each box on 
each level is performed to achieve linear scaling. 
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Fig. 4 The classical (sequential) FMM workflow consists of six stages. Only the nearfield (P2P) 
can be computed completely independent of all other stages. Each farfield stage (P2M, M2M, 
etc.) depends on the former stage and exhibits different amounts of parallelism. Especially the 
distribution of multipole and local moments in the tree provide limited parallelism in classical 
loop-based parallelization schemes 


4. L2L: Translate local moments Hım starting from the root node down towards the 
leaf nodes. Local moments within the same box are summed. 

5. L2P: After reaching the leaf nodes, the farfield contributions for the potentials 
ỌrFr, forces Fpp, and energy Epp are computed. 

6. P2P: Interactions between particles within each box and its direct neighbors are 
computed directly, resulting in the nearfield contributions for the potentials Pvp, 
forces Fyr, and energy Eyr. 


3.1.2 Features of Our FMM Implementation 


Our FMM implementation includes special algorithmical features and features that 
help to optimally exploit the underlying hardware. Algorithmical features are 


e Support for open and 1D, 2D and 3D periodic boundary conditions for cubic 
boxes. 

e Support for A-dynamics (Sect. 2). 

e Communication-avoiding algorithms for internode communication via MPI 
(Fig. 9). 

e Automatic tuning of FMM parameters d and p to provide automatic error control 
and runtime minimization [6] based on a user-provided energy error threshold 
AE (Fig. 10). 

e Adjustable tuning to reduce or avoid energy drift (Fig. 11). 


Hardware features include 


e A performance-portable SIMD layer (Sect. 3.2.1). 

e A light-weight, NUMA-aware task scheduler for CPU and GPU tasks 
(Sect. 3.2.2). 

e A GPU implementation based on CUDA (Sect. 3.4). 
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3.2 Utilizing Hierarchical Parallelism 
3.2.1 Intra-Core Parallelism 


A large fraction of today’s HPC peak performance stems from the increasing width 
of SIMD vector units. However, even modern compilers cannot generate fully vec- 
torized code unless the data structures and dependencies are very simple. Generic 
algorithms like FFTs or basic linear algebra can be accelerated by using third-party 
libraries and tools specifically tuned and optimized for a multitude of different 
hardware configurations. Unfortunately, the FMM data structures are not trivially 
vectorizable and require careful design. Therefore, we developed a performance- 
portable SIMD layer for non-standard data structures and dependencies in C++. 

Using only C++11 language features without third-party libraries allows to fine- 
tune the abstraction layer for the non-trivial data structures and achieve a better 
utilization. Compile-time loop-unrolling and tunable stacking are used to increase 
out-of-order execution and instruction-level parallelism. Such optimizations depend 
heavily on the targeted hardware and must not be part of the algorithmic layer of 
the code. Therefore, the SIMD layer serves as an abstraction layer that hides such 
hardware-specifics and that helps to increase code readability and maintainability. 
The requested SIMD width (1x, 2x, ..., 16x) and type (float, double) is selected 
at compile time. The overhead costs and performance results are shown in Fig. 5. 
The baseline plot (blue) shows the costs of the M2L operation (float) without any 
vectorization enabled. All other plots show the costs of the M2L operation (float) 
and 16-fold vectorization (AVX-512). Since the runtime of the M2L operation is 
limited by the loads of the M2L operator, we try to amortize these costs by utilizing 
multiple (2x ...6x) SIMDized multipole coefficient matrices together with a single 
operator via unrolling (stacking). As can be seen in Fig. 5, unrolling the multipole 
coefficient matrices 2x (red), we reach the minimal computation time and the 
expected 16-fold speedup. Additional unroll factors (3x ... 6x) will not improve 
performance due to register spilling. To reach optimal performance, it is required to 
reuse (cache) the M2L operator for around 300 (or more) of these steps. 


3.2.2 Intra-Node and Inter-Node Parallelism 


To overcome scaling bottlenecks of a pragma-based loop-level parallelization (see 
Fig. 4), our FMM employs a lightweight tasking framework purely based on C++. 
Being independent of other third-party tasking libraries and compiler extensions 
allows to utilize resources better, since algorithm-specific behavior and data-flow 
can be taken into account. Two distinct design features are a type-driven priority 
scheduler and a static dataflow dispatcher. The scheduler is capable of prioritizing 
tasks depending on their type at compile time. Hence, it is possible to prioritize 
vertical operations (like M2M and L2L) in the tree. This reduces the runtime 
twofold. First, it reduces the scheduling overhead at runtime by avoiding costly 
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Fig. 5 M2L operation benchmark for vectorized data structures with multipole order p = 10 on an 
Intel Xeon Phi 7250F CPU for a float type with 16x SIMD (AVX-512). The benchmarks shows the 
performance of different SIMD/unrolling combinations. E.g. the red curve (SIMD stacking 16 x 2) 
utilizes 16-fold vectorization together with twofold unrolling For a sufficient number (around 
300) of vectorized operations, a 16-fold improvement can be measured for the re-designed data 
structures 


| o O 
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Fig. 6 The data flow of the FMM still consists of six stages. However, synchronization now 
happens on a fine-grained level and not only after each full stage is completed. This allows to 
overlap parts that exhibit poor parallelization with parts that show a high degree of parallel code. 
The dependencies of such a data flow graph can be evaluated and even prioritized at compile time 


virtual function calls. Second, since the execution of the critical path is prioritized, 
the scheduler ensures that a sufficient amount of independent parallelism gets 
generated. The dataflow dispatcher defines the dependencies between tasks—a data 
flow graph—also at compile time (see Fig. 6). Together with loadbalancing and 
workstealing strategies, even a non-trivial FMM data flow can be executed. For 
compute-bound problems this design shows virtually no overhead. However, in MD 
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Fig. 7 Intranode FMM benchmark for 1000 particles, multipole order p = 1 and tree depth 
d = 3 on a 2x26-core Intel Xeon Platinum 8170 CPU. When using MCS locks, simultaneous 
multithreading and 50 threads, the overall improvement compared to the original implementation 
reaches >40%, translating into a reduction in runtime from 1.93 ms down to 1.14 ms 
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Fig. 8 Initial internode FMM benchmark for 1,000,000 particles, multipole order p = 3 and tree 
depth d = 5 with one MPI rank per compute node of the JURECA cluster 


we are interested in smaller particle systems with only a few hundred particles per 
compute node. Hence, we have to take even more hardware constraints into account. 
Performance penalties due to the memory hierarchy (NUMA) and costs to access 
memory in a shared fashion via locks introduce additional overhead. Therefore, 
we extended also our tasking framework with NUMA-aware memory allocations, 
workstealing and scalable Mellor-Crummey Scott (MCS) locks [35] to enhance the 
parallel scalability over many threads, as shown in Fig. 7. 

In the future, we will extend our tasking framework so that tasks can also be 
offloaded to local accelerators like GPUs, if available on the node. 

For the node-to-node communication via MPI the aforementioned concepts 
do not work well (see Fig. 8), since loadbalancing or workstealing would create 
large overheads due to a large amount of small messages. To avoid or reduce 
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Fig. 9 Left: Intranode FMM parallelization—efficiency of different threading implementations. 
Near field interaction of 114,537 particles in double precision on up to 28 cores on a single node 
with two 14-core Intel Xeon E5-2695 v3 CPUs. Single precision computation as well as other 
threading schemes (std::thread, boost::thread, OpenMP) showed similar excellent scaling behavior. 
The plot has been normalized to the maximum turbo mode frequency which varies with the number 
of active cores (3.3-2.8 GHz for scalar operation, 3.0-2.6 GHz for SIMD operation). Right: 
Internode parallelization—strong scaling efficiency of a communication avoiding, replication- 
based workload distribution scheme [10]. Near field interaction of 114,537 particles on up to 
65,536 Blue Gene/Q cores using replication factor c. In the initial replication phase, only c nodes 
within a group communicate. Afterwards, communication is restricted to all pairs of p/c groups. 
For 65,536 cores, i.e. only 1-2 particles per core initially, a maximum parallel efficiency of 84% 
(22 ms runtime) is reached for c = 64, and the maximal replication factor c = 256 yields an 
efficiency of 73%, while a classical particle distribution (c = 1) would require a runtime exceeding 
1 min due to communication latency 


the latency that comes with each message, we employ a communication-avoiding 
parallelization scheme [10]. Nodes do not communicate separately with each other, 
but form groups in order to reduce the total number of messages. At the same time 
the message size can be increased. Depending on the total number of nodes involved, 
the group size parameter can be tuned for performance (see Fig. 9). 


3.3 Algorithmic Interface 


Choosing the optimal FMM parameters in terms of accuracy and performance 
is difficult if not impossible to do manually as they also depend on the charge 
distribution itself. A naive choice of tree depth d and multipole order p might either 
lead to wasting FLOPs or to results that are not accurate enough. Therefore, d and 
p are automatically tuned depending on the underlying hardware and on a provided 
energy tolerance AE (absolute or relative acceptable error in Coulombic energy). 
The corresponding parameter set {d, p} is computed such that the accuracy is met 
at minimal computational costs (Fig. 10) [6]. 

Besides tuning the accuracy to achieve a certain acceptable error in the Coulom- 
bic energy for each time step, the FMM can additionally be tuned to reduce the 
energy drift over time. 
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Fig. 10 Depending on a maximum relative or absolute energy tolerance AE, the automatic 
runtime minimization provides the optimal set of FMM input parameters {d, p}. A lower requested 
error in energy results in an increased multipole order p (magenta). Since the computational 
complexity of the farfield operators M2M, M2L and L2L scales with p? or even p* (depending on 
the used implementation), the tree depth d is reduced accordingly to achieve a minimal runtime 
(green). With fractional depths [49], as used here, the runtime can be optimized even more than 
with integer depths 


Whereas multipole orders of about ten yield a comparable drift of the total energy 
over time as a typical simulation with PME, the drift with FMM can be reduced to 
much lower levels if desired (Fig. 11). 


3.4 CUDA Implementation of the FMM for GPUs 


A growing number of HPC clusters incorporate accelerators like GPUs to deliver 
a large part of the FLOPS. Also GROMACS evolves towards offloading more and 
more tasks to the GPU, for reasons of both performance and cost-efficiency [31, 32]. 

For system sizes that are typical for biomolecular simulations, FMM perfor- 
mance critically depends on the M2L and P2P operators. For multipole orders of 
about eight and larger their execution times dominate the overall FMM runtime 
(Fig. 12). 

Hence, these operators need to be parallelized very efficiently on the GPU. At the 
same time, all remaining operators need to be implemented on the GPU as well to 
avoid memory traffic between device (GPU) and host (CPU) that would otherwise 
become necessary. This traffic would introduce a substantial overhead as a complete 
MD time step may take just a few milliseconds to execute. 

Our encapsulated GPU FMM implementation takes particle positions and 
charges as input and returns the electrostatic forces on the particles as output. 
Memory transfers between host and device are performed only at these two points 
in the calculation step. 
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Fig. 11 Observed drift of the total energy for different electrostatics settings. Left: evolution of 
the total energy for PME with order 4, mesh distance 0.113 nm, ewald-rtol set to 1075 (black 
line) as well as for FMM with different multipole orders p at depth d = 3 (see legend in the right 
panel). Test system is a double precision simulation at T ~ 300 K in periodic boundaries of 40 Nat 
and 40 Cl~ ions solvated in a 4.07 nm? box containing extended simple point charge (SPC/E) water 
molecules [3], comprising 6740 atoms altogether. Time step At = 2 fs, cutoffs at 0.9 nm, pair-list 
updated every ten steps. Right: Black squares show the drift with PME for different Verlet buffer 
sizes for the water/ions system using 4 x 4 cluster pair lists [41]. For comparison, green line shows 
the same for pure SPC/E water (without ions) taken from Ref. [34]. Influence of different multipole 
orders p on the drift is shown for a fixed buffer size of 8.3 A. The GROMACS default Verlet buffer 
settings yield a drift of ~ 8 x 1075 kJ/mol/ps per atom for these MD systems, corresponding to the 
first data point on the left (black square/green circle) 


The particle positions and charges are split into different CUDA streams that 
allow for asynchronous memory transfer to the host. The memory transfer is 
overlapped with the computation of the spatial affiliation of the octree box. 

In contrast to the CPU FMM that utilizes O( p?) far field operators (M2M, M2L, 
L2L), the GPU version is based on the O( p*) operator variant. The O( p?) operators 
require less multiplications to calculate the result, but they introduce additional 
highly irregular data structures to rotate the moments. Since the performance of 
the GPU FMM at small multipole orders is not limited by the number of floating 
point operations (Fig. 12) but rather by scattered memory access patterns, we use 
the O(p*) operators for the GPU implementation. 

We will now outline our CUDA implementation of the operations needed in the 
various stages of the FMM (Figs. 4, 5, and 6), which starts by building the multipoles 
on the lowest level with the P2M operator. 
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Fig. 12 Colored bars show detailed timings for the various parts of a single FMM step on a GTX 
1080Ti GPU for a 103,000 particle system using depth d = 3. For comparison, total execution 
time for d = 3 on an RTX 2080 GPU is shown as brown line, whereas black line shows timings 
for d = 4 on a GTX 1080Ti GPU. CUDA parallelization is used in each FMM stage leaving the 
CPU mostly idle 


3.4.1 P2M: Particle to Multipole 


The P2M operation is described in detail elsewhere [44]. The large number of 
registers that is required and the recursive nature of this stage limits the efficient 
GPU parallelization. The operation is however executed independently for each 
particle and the requested multipole expansion is gained by summing atomically 
into common expansion points. The result is precomputed locally using shared 
memory or intra-warp communication to reduce the global memory traffic when 
storing the multipole moments. The multipole moments w, local moments jz and 
the far field operators A, M, and C are stored as triangular shaped matrices 


w, U, A,Ce K?P*? v= { (X1m)1=0.....p. m=—I,...,l | Xim € C } (7) 


and M € K??*??, where p is the multipole order. 

To map the triangular matrices efficiently to contiguous memory, their elements 
are stored as 1D arrays of complex values and the /, m indices are calculated on the 
fly when accessing the data. For optimal performance, different stages of the FMM 
require different memory access patterns. Therefore, the data structures are stored 
redundantly in a Structure of Arrays (SoA) and Array of Structures (AoS) version. 
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The P2M operator writes to AoS, whereas the far field operators use SoA. A copy 
kernel, negligible in runtime, does the copying from one structure to another. 


3.4.2 M2M: Multipole to Multipole 


The M2M operation, which shifts the multipole expansions of the child boxes to 
their parents, is executed on all boxes within the tree, except for the root node which 
has no parent box. The complexity of this operation is O(p*); one M2M operation 
has the form 


L j 
omla) =Y X ojla)A-jm-la — a"), (8) 


j=0 k=-j 


where A is the M2M operator and a and a’ are different expansion center vectors. 
The operation performs O(p*) dot products between w and a part of the operator 
A. These operations need to be executed in all boxes in the octree, excluding the 
box on level 0, i.e. the root node. The kernels are executed level wise on each 
depth, synchronizing between each level. Each computation of the target wim for 
a distinct (l, m) pair is performed in a different CUDA block of the kernel, with 
threads within a block accessing different boxes sharing the same operator. The 
operator can be efficiently preloaded into CUDA shared memory and is accessed 
for different wım residing in different octree boxes. Each single reduction step 
is performed sequentially by each thread. This has the advantage that the partial 
products are stored locally in registers, reducing the global memory traffic since only 
O(p?) elements are written to global memory. It also reduces the atomic accesses, 
since the results from eight distinct multipoles are written into one common target 
multipole. 


3.4.3 M2L: Multipole to Local 


The M2L operator works similarly to M2M, but it requires much more transforma- 
tions as each source w is transformed to 189 target u boxes. The group of boxes 
to which a particular w is transformed to is called the interaction set. It contains 
all child boxes of the direct neighbor boxes of the source’s w parent. The M2L 
operation is defined as 


p j 
Um) =X J @jx(@)Mi+jm+e(a — 1), (9) 


j=0k=-j 


where r and a are different expansion centers. The operation differs only slightly 
from M2M in the access pattern but is of the same O(p*) complexity. As the 


GROMEXxX: A Versatile FMM for Biomolecular Simulation 535 


M2L runtime is crucial for the overall FMM performance, we have implemented 
several parallelization schemes. Which scheme is the fastest depends on tree depth 
and multipole order. The most efficient implementation is based on presorted lists 
containing interaction box pointers. The lists are presorted so that the symmetry 
of the operator M can be exploited. In M, the orthogonal operator elements differ 
only by their sign. Harnessing this minimizes the number of multiplications and 
global memory accesses and allows to reduce the number of spawned CUDA blocks 
from 189 to 54. However, it introduces additional overhead in logic to change 
signs and computations of additional target u box positions, so the performance 
speedup is smaller than 189/54. The kernel is spawned similarly to the M2M 
kernel performing one dot product per CUDA block preloading the operator M 
into shared memory. The sign changing is done with the help of and additional 
bitset provided for each operator. Three different parallelization approaches are 
compared in Fig. 13. Considering the hardware performance bottlenecks of this 
stage, the limitations highly differ for particular implementations. The naive M2L 
kernel is clearly bandwidth limited and achieves nearly 500 GB/s for multipole 
orders larger than ten. This is higher than the theoretical memory throughput of 
the tested GPU, which is 480 GB/s, due to caching effects. The cache utilization 
is nearly at 100% achieving 3500 GB/s. However, the performance of this kernel 
can be enhanced further by moving towards more compute bound regime. With the 
dynamical approach the performance is mainly limited by the costs of spawning 
additional kernels. It can be clearly seen with the flat curve shape for multipoles 
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Fig. 13 Comparison of three different parallelization schemes for the M2L operator, which is the 
most compute intensive part of the FMM algorithm. The naive implementation (red) directly maps 
the operator loops to CUDA blocks. It beats the other schemes only for orders p < 2. Dynamic 
parallelization (blue) is a CUDA specific approach that dynamically spawns thread groups from 
the kernels. The symmetric scheme (magenta) represents the FMM tree via presorted interaction 
lists. It also exploits the symmetry of the M2L operator 
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Fig. 14 Hardware utilization of the symmetrical M2L kernel of the GPU-FMM 


smaller than 13 in Fig. 13. The hardware utilization for the symmetrical kernel 
is depicted in Fig. 14. The performance of this kernel depends on the multipole 
order p, since p* is a CUDA gridsize parameter [40]. The values p < 7 lead 
to underutilization of the underlying hardware, however they are mostly not of 
practical relevance. For larger values the performance is operations bound achieving 
about 80% of the possible compute utilization. 


3.4.4 L2L: Local to Local 


The L2L operation is executed for each box in the octree, shifting the local moments 
from the root of the tree down to the leaves, opposite in direction to M2M. Although 
the implementation is nearly identical, it achieves slightly better performance than 
M2M because the number of atomic memory accesses is reduced due to the tree 
traversing direction. For the L2L operator, the result is written into eight target 
boxes, whereas M2M gathers information from eight source boxes into one. 


3.4.5 L2P: Local to Particles 


The calculation of the potentials at particle positions x; requires evaluating 


P I 
(x1) = Y O Mimin i= 0,..., Nox, (10) 


1=0 m=—1 
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where Oh n is a chargeless multipole moment of particle at position x; and Npox the 
number of particles in the box. The complexity of each operation is O(p?). This 
stage is similar to P2M since the chargeless moments need to be evaluated for each 
particle using the same routine for a charge of g = 1. The performance is limited by 
register requirement but like in the P2M stage it runs concurrently for each particle 
and it is overlapped with the asynchronous memory transfer from device to host. 


3.4.6 P2P: Particle to Particle 


The FMM computes direct Coulomb interactions only for particles in the leaves 
of the octree and between particles in boxes that are direct neighbors. These 
interactions can be computed for each pair of atoms directly by starting one thread 
for each target particle in the box that sequentially loops over all source particles. An 
alternative way that better fits the GPU hardware is to compute these interactions for 
pairs of clusters of size M and N particles, with M x N = 32 the CUDA warp size, 
as laid out in [41]. The forces acting on the sources and on the targets are calculated 
simultaneously. The interactions are computed in parallel between all needed box- 
box pairs in the octree. The resulting speedup of computing all atomic interactions 
between pairs of clusters instead of using simpler, but longer loops over pairs of 
atoms is shown in Fig. 15. The P2P kernels are clearly compute bound. The exact 
performance evaluation of the kernel can be found in [41]. 
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Fig. 15 Speedup of calculating the P2P direct interactions in chunks of M x N = 32 (i.e. for 
cluster pairs of size M and N) compared to computing them for all atomic pairs (i.e. for “clusters” 
of size M = N = 1). All needed FMM box-box interactions are taken into account 
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3.5 GPU FMM with X-Dynamics Support 


In addition to the regular Coulomb interactions, with A-dynamics, extra energy 
terms for all forms of all A sites need to be evaluated such that the forces on 
the A particles can be derived. The resulting additional operations exhibit a very 
unstructured pattern that varies depending on the distribution of the particles 
associated with À sites. Such a pattern can be described by multiple sparse FMM 
octrees that additionally interact with each other. The sparsity that emerges from a 
relatively small size of the à sites necessitates a different parallelization than for 
a standard FMM. To support A-dynamics efficiently, all stages of the algorithm 
were adapted. Especially, the most compute intense shifting (M2M, L2L) and 
transformation (M2L) operations need a different parallelization than that of the 
normal FMM to run efficiently for a sparse octree. Figure 16 shows the runtime of 
the CUDA parallelized -FMM as a function of the system size, whereas Fig. 17 
shows the overhead associated with A-dynamics. The overhead that emerges from 
addition of à sites to the simulation system scales linearly with the number of 
additional sites with a factor of about 107° per site. This shows that the FMM 
tree structure fits particularly well the A-dynamics requirements for flexibility to 
compute the highly unstructured, additional particle-particle interactions. Note that 
our A-FMM kernels still have the potential for more optimizations (at the moment 
they achieve only about 60% of the efficiency of the regular FMM kernels) such that 
for the final optimized implementation we expect the costs for the additional sites 
to be even smaller than what is shown in Fig. 17. 
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Fig. 16 Absolute runtime of the A-FMM CUDA implementation. For this example we use one A 
site per 4000 particles as estimated from the hen egg white lysozyme model system for constant- 
pH simulation. Each form of a À site contains ten particles. The tests were run on a GTX 1080Ti 
GPU 
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Fig. 17 As Fig. 16, but now showing relative costs of adding A-dynamics functionality to the 
regular GPU FMM 


4 Conclusions and Outlook 


All-atom, explicit solvent biomolecular simulations with A-dynamics are still 
limited to comparatively small simulation systems (<100,000 particles) and/or short 
timescales [7, 9, 18]. To ultimately allow for a realistic (e.g., const-pH) treatment 
of large biomolecular systems on long timescales, we are developing an efficient 
FMM that computes the long-range Coulomb interactions, including local charge 
alternatives for a large number of sites, with just a small overhead compared to the 
case without A-dynamics. 

Our FMM library is a modern C++11 based implementation tailored towards the 
specific requirements of biomolecular simulation, which are a comparatively small 
number of particles per compute core and a very short wall clock time per iteration. 
The presented implementation offers near-optimal performance on various SIMD 
architectures, an efficient CUDA version for GPUs, and it makes use of fractional 
tree depths for optimal performance. In addition to supporting chemical variability 
via A-dynamics, it has several more unique features such as a rigorous error control, 
and based upon that, an automatic performance optimization at runtime. The energy 
drift resulting from errors in the FMM calculation can be reduced to virtually zero 
with a newly developed scheme that adapts the multipole expansion order p locally 
and on the fly in response to the requested maximum energy error. With fixed p, 
using multipole orders 10-14 yields drifts that are smaller than those observed for 
typical simulations with PME. We expect the FMM to be useful also for normal 
MD simulations, as a drop-in PME replacement for extreme scaling scenarios where 
PME reaches its scaling limit. 

The GPU version of our FMM will implicitly use the same parallelization 
framework as the CPU version. In fact, GPUs will be treated as one of several 
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resources a node offers (in addition to CPUs), to which tasks can be scheduled. 
As our GPU implementation is not a monolithic module, it can be used to calculate 
individual parts of the FMM, like the near-field contribution or the M2L operations 
of one of the local boxes only, in a fine-grained manner. How much work is offloaded 
to local GPUs will depend on the node specifications and on how much GPU and 
CPU processing power is available. 

The A-dynamics module allows to choose between three different variants of 
A-dynamics. The dynamics and equilibrium distributions of the lambdas can be 
flexibly tuned by a barrier potential, whereas buffer sites ensure system neutrality 
in periodic boundary conditions. Compared to a regular FMM calculation without 
local charge alternatives, the GPU-FMM with A-dynamics is only a factor of two 
slower even for a large (500,000 atom) simulation system with more than 100 
protonatable sites. 

Although some infrastructure that is needed for out-of-the-box constant-pH 
simulations in GROMACS still has to be implemented, with the A-dynamics and 
FMM modules, the most important building blocks are in place and performing 
well. The next steps will be to carry out realistic tests with the new A-dynamics 
implementation and to thoroughly compare to known results from older studies, 
before advancing to larger, more complex simulation systems that have become 
feasible now. 
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Abstract In recent years the increasing compute power is mainly provided by 
rapidly increasing concurrency. Therefore, the HPC community is looking for 
new parallel programming paradigms to make the best use of current and up- 
coming machines. Under the Japanese CREST funding program, the post-petascale 
HPC project developed the XcalableMP programming paradigm, a pragma-based 
partitioned global address space (PGAS) approach. To better exploit the potential 
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concurrency of large scale systems, the mSPMD model was proposed and imple- 
mented with the YvetteML workflow description language. When introducing a new 
parallel programming paradigm, good tool support for debugging and performance 
analysis is crucial for the productivity and therefore the acceptance in the HPC 
community. The subject of the MYX project is to investigate which properties of a 
parallel programming language specification may help tools to highlight correctness 
and performance issues or help to avoid common issues in parallel programming in 
the first place. In this paper, we exercise these investigations on the example of 
XcalableMP and YvetteML. 


1 Introduction 


Exascale systems are expected to consist of tens of thousands of compute nodes, 
complemented by specialized accelerators, resulting in system architectures which 
are heterogeneous on multiple levels. Such architectures challenge the programmer 
to write multi-level parallel programs, which means employing multiple different 
paradigms to address each level of parallelism in the system [2]. This ranges 
from inter-node parallelism in the form of distributed memory parallelism, over 
shared-memory parallelism to exploit multi-core processors and acceleration units, 
to vector-style parallelism to target corresponding hardware units. The long-term 
challenge is to evolve existing and develop new programming models to better sup- 
port the application development on exascale machines. For different domains and 
different abstraction levels, various programming models have gained momentum. 
While there is ongoing research on how to make the currently predominant HPC 
programming model—namely MPI+X—scale well on such systems, the emerging 
and more high-level PGAS programming models have shown to deliver high 
productivity for users and certain types of codes [10]. The JST-CREST funded post- 
petascale HPC project developed the XcalableMP (XMP) programming paradigm, 
which combines local and global view PGAS concepts. 

The multi-level programming paradigm FP3C [13] as described later in this paper 
is a solution for post-petascale systems targeting a huge number of processors and 
the attached acceleration devices. Programmers can express high-level parallelism 
in the YvetteML (YML) workflow language and employ parallel components 
written in SPMD programming paradigms like XMP or MPI. Since YML drives and 
executes multiple SPMD tasks at the same time, this is characterized as mSPMD. 

The MYX project aims to combine the know-how and lessons learned of different 
areas to derive the input necessary to guide the development of future programming 
models and software engineering methods. Therefore we are developing correctness 
checking techniques for the XMP programming paradigm and make this analysis 
also available for the multi-level programming paradigm FP3C. 
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The contributions of this work are: 


e Identify possible correctness issues of XMP applications, 

e define an XMP tools interface to provide runtime state and event information to 
runtime tools, 

e extend MUST by XMP specific runtime correctness analyses, 

e extend YML to soundly support innovative numeric techniques like UCGLE, and 

e provide a workflow to analyze YML+XMP applications driven by the FC2P 
framework. 


The structure of the remaining paper is as follows. In Sect. 2 we introduce the 
concept of runtime correctness checking and provide an overview of the general 
implementation of MUST. In Sect.3 we provide a brief overview of the concepts 
of the XMP programming paradigm based on an example code. In Sect.4 we 
highlight potential correctness issues in XMP applications and what information 
is needed to analyze those errors. To perform such runtime analysis, a tool like 
MUST needs state and event information from the XMP runtime system. In Sect. 5 
we, therefore, provide a brief overview of the tools interface that we proposed as 
an extension of XMP to the XMP specification consortium. In Sect. 6 we introduce 
the concepts of the YML workflow language based on an example code. The unite 
and conquer method described in Sect.7 represents an example use case for an 
mSPMD program implemented with YML for the coarse-grained parallelism and 
XMP for the implementation of the individual YML tasks. To implement such a 
method, some extensions of YML are necessary, we also discuss the implications 
for correctness. In Sect.8 we present the FP2C framework, which provides a 
YML+XMP implementation targeted to HPC systems. As MPI is basically the 
standard for distributed memory HPC systems and those systems also prefer fixed- 
width jobs, i.e., jobs with a fixed number of processes the FP2C framework 
is implemented with MPI and dynamically launches MPI processes to fill the 
requested number of process slots. Finally, in Sect.9 we present the challenges 
and solutions to provide runtime correctness analysis in MUST for such a dynamic 
runtime system. 


2 Runtime Correctness Analysis for Parallel Programs 


Other than serial programs, parallel programs are affected by non-determinism as 
an effect of the concurrent execution of multiple threads or processes. For defect 
programs, this non-determinism can manifest as data races or deadlocks which are 
not known in serial programming. Different approaches to identify and remove the 
defects in those programs include static code analysis, model checking, and runtime 
or post-mortem analysis. Here we want to discuss runtime correctness analysis, 
where the error detection is performed during the execution of the program. 

MUST performs runtime correctness checking for MPI parallel applications. The 
application developer executes the application under the control of MUST, which 
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checks at execution time whether the usage of MPI is valid according to the MPI 
specification. For MPI applications we have shown, that the execution overhead 
for such runtime analysis is below 20% for the typical use case [8]. Although 
we aim at complete coverage of the MPI specification, the focus is currently on 
communication functions. 

The overhead of runtime correctness analysis depends significantly on the 
granularity of the analysis. For MPI, the granularity is quite coarse-grained: there 
is typically a lot of calculation between MPI function calls which is not analyzed. 
For data race detection in multithreaded applications, the granularity of analysis is 
much more fine-grained, as each individual memory access is subject to analysis. 
Therefore we see a two to hundredfold runtime overhead for data race analysis. 


2.1 Runtime Analysis in MUST 


For runtime analysis of distributed memory applications, we distinguish three kinds 
of analyses as shown in Fig. 1. Local analysis only needs information from a single 
application process and can be performed within the application process to avoid 
unnecessary data transfer. In a multi-threaded application, this analysis potentially 
needs information from multiple threads. We, therefore, spawn an additional 


Local analysis 

- opaque handle usage 

- local type matching 

- data race detection 

- buffer overlap detection 


Distributed analysis 

- distributed deadlock analysis 
- point-to-point matching 

- p2p type matching 


MUST |<] MUST 


O thread 
O process 
E application 


E MUST Centralized analysis 
- deadlock graph analysis 
- collective matching 


> 
> shared-memory - HTML output 
communication 


MPI communication MUST 


Fig. 1 MUST applied to a hybrid parallel application with four processes po... p3 and four 
threads tọ ...t3; each. MUST spawns an extra tool thread in each process, which communicates 
with the additional tool processes using a tree-based overlay network (TBON). Each analysis is 
performed on the first tool layer that has sufficient information to perform the specific analysis. 
In the typical setup, communication between MUST processes is performed using MPI. The 
communication between the threads uses shared memory communication 
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analysis thread to have all necessary information available. In a single-threaded 
application or when only the master thread communicates, the local analysis is 
performed in the application thread and no additional tool thread is spawned. As 
an example, local type matching compares compile-time information about types 
used by the application on variable declaration with the runtime information on the 
corresponding MPI data types used in communication [9]. 

Distributed analysis needs information from more than one application process. 
For scalability reasons, the analysis is distributed in the analysis tree and performed 
in the node where sufficient information is available. As an example, the distributed 
deadlock detection analysis runs on the parent node of each application process. It 
works with a distributed state transition system, which is fed with send and receive 
information for the specific process, as well as completion notification for collective 
communication [8]. 

Centralized analysis needs global information from multiple or all application 
processes. For scalability, a tree reduction analysis is applied where possible, so 
that the centralized analysis is just the last step in such reduction. An example of 
such tree reduction is collective matching analysis, where each tree node compares 
the parameters in collective communication for all child nodes and finally passes 
one representative to the parent tree node. An example of completely centralized 
analysis is graph-based deadlock analysis, which we use to visualize the circular 
dependencies causing a deadlock, but also to verify the presence of a deadlock 
in the time-out based deadlock detection. This analysis needs information on all 
pending communication operations but is only executed when a deadlock is detected 
or suspected. 


2.2 Underlying Tool Infrastructure of MUST 


MUST intercepts events in the execution of a targeted application to apply the 
analysis based on the information from these application events. Initially, these 
events were MPI function calls, but this is now extended to OpenMP and XMP 
events that are delivered to registered callback functions. Within an application 
process or thread, the tool can only get active when such an event is delivered. 
MUST builds on a tree-based overlay network (TBON) communication subsystem, 
as depicted in Fig. 1. Since the tool cannot make any assumptions, when it will 
be active on an application process or thread, those nodes communicate only 
towards the root of the tree. For use cases as point-to-point matching and distributed 
deadlock detection, the classical TBON communication scheme was extended by 
horizontal communication within a tool layer. In the current default configuration 
of MUST, all processes are started together as MPI processes with a common 
MPI COMM WORLD. Using the MPI interception layer PMPI, MUST then makes 
sure, that only the processes intended to execute the application code will continue 
execution. The tool processes remain in a run loop which waits for incoming 
events to process. Whenever application code uses MPI_COMM_ WORLD in an MPI 
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function call, this communicator is replaced by a sub-communicator representing 
the application processes. This is transparent for the application, which will never 
see the additional analysis processes. For performance reasons, the tree layout— 
including the number of layers and placement of analysis functions— is hard-coded 
and compiled into a specific instance of the tool. This means that with the current 
tool infrastructure a dynamic reconfiguration of the tree is not possible at runtime. 


3 XcalableMP 


XcalableMP (XMP) is a directive-based language extension for Fortran and C 
languages. Like in OpenMP, parallelism is introduced by the use of directives. If all 
the directives are ignored by the compiler, a serial program with the same semantics 
and results should remain. XMP targets parallel programming for distributed 
memory systems, in contrast to OpenMP targeting shared-memory parallelism. The 
implementation of XMP in the OmniCompiler is a source to source transformation, 
which translates the directives into additional code and calls into the XMP runtime 
library. The XMP runtime library communication is mainly performed using MPI. 
Therefore, it is in general also possible to use MPI communication in XMP 
programs or link a library written with XMP into an MPI application. 

Listing | shows an example of an XMP distributed parallel source code. This 
example assumes the execution with four processes which are assigned to the 
nodeset p. In XMP, the distribution of a virtual array onto nodes is defined as a 
template. An array is then associated with a template using the align statement. 
This defines the distribution of the array over the nodes. Also, the distributed 
processing is defined by applying the template to a loop directive. The iterations 
of the loop are executed on the different processes according to the distribution 
assigned to the template t. 

For parallelization of stencil codes on distributed memory systems, there is 
typically the need to use a halo as temporary copy for the calculation of the boundary 
in the local share. XMP supports such behavior with two directives. The shadow 


Listing 1 Global-view programming: distribute data and work to processes (nodes) 


#pragma xmp nodes p (4) 

#pragma xmp template t (0:11) 

#pragma xmp distribute t (block) onto p 

int B[12]; // Data Mapping 
#pragma xmp align B[i] with t(i) 


#pragma xmp loop (i) on t(i) 
for(i=0; i<12; i++) { // Work Mapping 
B[i]=B[i] *2; 


} 
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Listing 2 Local-view programming: use coarray notion in C code 


int a[10]:[*], b[10], e@[10]; 
#pragma xmp nodes p(4) 


int me = xmpc_this_image() ; 


int right = (me + 1) % 4, left = (me + 3) % 4; 
for(i=0; i<10; i++) { 

b[i]=me; // initialize 
} 
a[:]:[right] = b[:]; // put to right neighbor 
xmp_sync_all (NULL); // barrier and sync memory 
b[:] = al[:]; // local copy 
e[:] = a[:]: [left]; // get from left neighbor 


directive allows to specify the width of the halo for a specific distributed array and 
the reflect directive is to perform the update of the halo. 

The functionality described so far is called as global-view programming in XMP. 
In global-view programming, the application programmer does not need to care 
where data is located. The array is transparently distributed and accessed. In the 
suggested workflow, the work is performed where the data is located. 

Furthermore, XMP extends Coarray Fortran and makes this functionality also 
available in C. In XMP this is called local-view programming. To access memory 
on a different process in local-view programming means to explicitly specify the 
target process, that holds the image of interest. 

The code example in Listing 2 demonstrates how XMP allows using the concept 
of Coarray in C code. The array a is declared as a Coarray of size 10 with an image 
on each process. The image selector is separated with a colon in the declaration. A 
classical, local array b is initialized in the for loop and then assigned to the Coarray 
a. The slice notation b[:] similar to Fortran allows assigning a whole array at 
once. The assignment to the remote image right is semantically a put operation. 
Therefore, the slice notion is not only a convenience feature but allows to perform 
a single memory transfer in comparison to a for loop, which assigns each array 
element individually. 


4 Correctness Checking for XMP Programs 


In this section, we will discuss possible programming errors in XMP applications 
and how to detect those errors in the code. 


552 J. Protze et al. 
4.1 Programming Errors in XMP Programs 


For global-view programming we identified a range of possible programming errors 
that violate restrictions provided for the specific XMP construct. As an example, the 
barrier construct has the following restriction: 


e The nodeset specified by the on clause must be a subset of the executing nodeset. 


The code example in Listing 3 violates this restriction, because the task construct 
limits the executing nodeset to process p (0) , while the nodeset specified by the on 
clause is the complete nodeset p. The code presents also the MPI idiom with the 
same semantic. Since only the process with the rank number 0 reaches the barrier, 
this will finally result in a deadlock for the MPI code. 

Besides violations against restrictions imposed by the XMP specification, we also 
identified possible data races for asynchronous communication. The code example 
in Listing 4 initializes a distributed array, which is defined with a surrounding halo. 
The update of the halo is performed asynchronously, because of the async clause. 


Listing 3 Only a subset of processes participates in a collective barrier operation 


#pragma xmp task on p(0) 

{ 
printf ("Only executed on rank 0"); 
#pragma xmp barrier on p 

} 


if(rank == 0) 

{ 
printf ("Only executed on rank 0"); 
MPI_Barrier(MPI_COMM WORLD) ; 


} 


Listing 4 Asynchronously updating the halo can result in a data race 


int a[16]; 

#pragma xmp nodes p[4] 

#pragma xmp template t[16] 
#pragma xmp distribute t onto p 
#pragma xmp align a[i] with t[i] 
#pragma xmp shadow a[1] 


#pragma xmp loop (i) on t[i] 
for(int i=0;i1<16;1i++) 
ali] =i > 4; 


#pragma xmp reflect (a) width (/periodic/1) async(100) 
for(int i=0;i1<16;1i++) 

a[i] = a[i-1] + a[i+1]; 
#pragma xmp wait (100) 
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Before the asynchronous execution is finished, the stencil access already reads this 
halo value. It is unclear whether the old or the new value is read, but also whether 
the old or the new value is sent to the neighbor. 

Another possible error arises from the use of the orthogonal clause with the 
reflect construct. With this clause, only the orthogonally adjacent halo will be 
updated, but not the corners or edges of a multidimensional halo. For most stencil 
applications, this is sufficient and saves a lot of communication, because the corner 
is located on a different process. If the application nevertheless needs and accesses 
the value, it will see an uninitialized value there. 

For local-view programming, the main risk is data race in the remote memory 
access. This can occur if different processes access the same memory in the same 
image without synchronization and one of them modifies the memory. Revisiting 
the code example in Listing 2, we would have a data race on a if we remove the 
function call to xmp_sync_all. The left neighbor updates the local image a 
of a process, which would then not be synchronized with the local access to a and 
also not synchronized with the read by the right neighbor. 


4.2 Correctness Analysis for XMP Programs 


Since XMP programs translate to MPI programs in the implementation provided 
by the omni-compiler, we can apply native MPI correctness analysis to XMP 
programs. For an application which implements Listing 3, MUST detects a deadlock 
between an MPI Barrier implementing the XMP barrier directive and an 
MPI Barrier inserted by the XMP compiler at the end of the task region. 
Figure 2 shows the deadlock as reported by MUST. The left diagram depicts 
the cyclic dependency detected by MUST, where MPI_Barrier is called with 
two different communicators. The MUST report provides further details about 
these communicators, which are created by the XMP implementation. The right 
diagram provides additional information on the function stack for the two conflicting 
MPI Barrier calls. XMP Barrier is the XMP runtime implementation for 
any explicit or implicit barrier. The graph also shows that this XMP barrier is called 
from two different locations—lines 15 and 23—in the source code, although the 
original source code only has 15 lines. 

This example emphasizes, that correctness analysis for XMP applications can be 
done at the MPI level, but is not too useful for the application developer. In other 
previous work [1, 11] we have seen that we can achieve better results concerning 
precision and recall if we base the analysis on the semantics of the high-level parallel 
programming paradigm. Furthermore, the analysis at a higher abstraction level can 
help to provide more meaningful error reports. In the following, we will see how 
this applies to XMP. 


Analysis in Global-View Programming For the errors, where XMP code might 
violate restrictions imposed by the XMP specification, we distinguish between static 
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Fig. 2 Deadlock detected by MUST for Listing 3 when only looking at MPI 


and dynamic properties of the code. Most of the restrictions on the standalone 
declarative directives like nodes, template, or align have an impact on static 
properties of the code. Those restrictions include self-reference in a declaration or 
lexical name conflicts of handle names with other symbols in a scoping unit. Such 
restrictions should be enforced by the compiler and result in meaningful compile- 
time error messages. 

Restrictions on dynamic properties depend on the specific value a variable has at 
runtime. We distinguish between those that involve only the local process and those 
that involve multiple or all processes. For global-view programming the latter is 
only the case for collective operations, which require consistent clauses and values 
among all contributing processes. As for MPI collective communication functions, 
we analyze this as a reduction analysis, where each node in the TBON compares all 
incoming events and passes one representative event to the parent node. 

All other restrictions on dynamic properties can be analyzed locally. Various 
XMP constructs have the same restriction as mentioned in Sect.4.1 for the 
barrier construct. The nodeset used for the construct needs to be a subset of 
the currently executing nodeset. The executing nodeset is the set of nodes executing 
the current XMP region. The loop construct as well as the task construct allow 
to restrict the currently executing nodeset. To perform runtime analysis for such a 
subset requirement, an analysis tool needs to understand the concept of nodeset and 
how they can be derived. Listing 5 provides some examples for slicing nodesets. 
The nodeset p consists of the eight processes executing the application, it is also 
called entire nodeset. The nodeset q skips the first node in p and recruits nodes two 
to five from p. The nodeset r is a two-dimensional nodeset, which can be used from 
two-dimensional domain decomposition. The nodeset s is also two-dimensional, 
but uses only every other node in p. Now, checking whether q [2] is subset of s 
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Listing 5 Deriving nodesets in XMP 


#pragma xmp nodes p(8) // entire nodeset 
#pragma xmp nodes q(4)=p(2:5) // slicing nodes 2-5 
#pragma xmp nodes r(2,4)=p(1:8) // 2-dimensional nodeset 
#pragma xmp nodes s(2,2)=p(1:8:2) // skip every other node 


is a non-trivial question. One method to perform the subset analysis is to expand 
the given nodeset to the entire nodeset while marking participating nodes. This 
expansion is not scalable for large numbers of processes, but for some comparison 
of nodesets we cannot avoid the full expansion. 


Analysis in Local-View Programming For local-view programming, our main 
focus is on data race detection in remote memory access. A data race is commonly 
understood as concurrent access of multiple execution entities to the same data in 
memory while at least one access is writing to memory. Concurrent access implies 
that there is no synchronization between the two memory accesses. We can observe 
two ways of access to a coarray in XMP, both can be found in the code example 
in Listing 2. The remote image access denoted by a[:] : [target] has write 
semantics on the target memory for the put operation and read semantics for the get 
operation. The local image access denoted by a[:] or by a[1] has the memory 
access semantics as suggested by the base language. In general, an application might 
access and modify the local image through a pointer to the local image. Especially, 
when the array is passed to a library, as a linear algebra library, the access to the local 
image is out of control of the XMP compiler or runtime system. To detect data races 
on remote memory access, we need to instrument the local memory accesses as well 
as tracking all remote memory accesses. Since the conflicting memory access might 
occur in the library, also the library needs to be instrumented for the runtime data 
race analysis. This is particularly difficult if the library is only available as a binary. 

For data race analysis in MUST, we build on ThreadSanitizer as logging and 
analysis backend. Memory access instrumentation is performed by clang or GNU 
compiler during compilation. In addition, we provide high-level synchronization, 
memory access, and concurrency semantics into the ThreadSanitizer analysis. 
Therefore we extend the annotation interface used by ThreadSanitizer and Valgrind 
to feed all necessary information into the analysis. An access to the local image 
by a remote process should be seen concurrent to any previous access by a 
different process, that is not synchronized with the current access. Synchronization 
in XMP is possible with global synchronization, e.g., using the sync all directive 
respectively an xmp_sync_all1() call, or point to point synchronization, e.g., 
using the sync image directive respectively an xmp_sync_image() function 
call. 
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5 Tools Interface for XcalableMP 


To provide XMP specific runtime information to analysis tools, we designed and 
implemented the XMP tools interface—XMPT. The interface builds on experiences 
from the OpenMP tools interface. As an example, the specific environmental 
variable X€MP_TOOL LIBRARIES allows loading an XMP specific tool, when the 
XMPT interface is available during application execution. This imitates the OMPT 
specific environmental variable OMP_TOOL LIBRARIES and allows building 
portable tools, which dynamically adapt to the parallel programming paradigm used 
by a program. 

During startup of an XMP application, the XMP runtime tries to find an XMPT 
tool, which is identified by the exported function xmpt_ initialize. Other than 
in OMPT we don’t need a three-way handshake for tool initialization, as the XMP 
runtime doesn’t need to adopt the own initialization in case an XMPT tool is present. 
Once a tool is found, the XMP runtime calls this function and the tool has the chance 
to register callbacks for certain XMP events. The current implementation of XMPT 
provides callbacks for all global-view directives and constructs as well as for coarray 
memory access and synchronization in local-view programming. 

The data mapping identifiers like node-names and template-names are identified 
by their opaque XMP descriptor handles. To recognize such a descriptor and store 
information on the descriptor, the XMPT interface allows binding tool data to each 
XMP descriptor. 

The OpenMP specification restricts the OMPT tool to only use OMPT runtime 
functions, but not to call OpenMP runtime routines like omp_get_num_threads, 
nor to use OpenMP pragmas to implement OMPT callback functions or signal 
handlers. Without this restriction, the OMPT tool might cause a deadlock in the 
execution of an OpenMP application, because the OpenMP runtime could hold a 
lock that it tries to acquire again when the OpenMP runtime function is called. The 
main difference in this particular aspect is that XMP is initialized explicitly at an 
early point in the execution by calling xmp_ init, while OpenMP implementations 
tend to lazy initialize when the first OpenMP construct or runtime routine is called. 
For thread-safe initialization, the OpenMP runtime might acquire an initialization 
lock at any entry to the runtime. 

For XMPT there is no restriction on the use of XMP runtime functions so that 
an XMPT tool can use the variety of inquiry functions to collect all necessary 
information about the opaque XMP descriptor handles. This allows to query 
information on XMP specific entities on demand and avoids to transport all available 
information as arguments to the callbacks. This makes the interface both more 
compact and more efficient. 
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6 YvetteML 


YvetteML (YML) is a workflow description language for technical or scientific 
calculation that describes dependencies among tasks. 

YML interprets low source code and dependencies between tasks to generate the 
indicated DAG and execute the task according to the DAG. YML of the original 
casing is, P2P tasks that are written sequentially in the language it was assumed 
to run in an environment or a small cluster, of tasks written in a parallel language. 
By using YML it became possible to run the application on a large scale system. 
We also developed middleware for porting. The middleware used to implement the 
mSPMD programming model is OmniRPC-MPI [13]. It provides Remote Procedure 
Call (RPC) based on MPI and is an extension of the library OmniRPC. Our 
OmniRPC-MPI middleware is a workflow scheduler to control remote programs 
which are created for task execution by use of MPI_Comm_spawn on request. 
Control and data flow is implemented using MPI functions such as and MPI_ Send 
and MPI_Recv. 

Listing 6 shows a simple example for a YML program. It invokes a function 
add which takes two double arguments and on return provides the sum in the first 
argument. The execution starts sequentially, at first result = 1 + 2 is calculated. 
Then execution continues parallel with three concurrent code blocks, separated by 
/ /. The first code block is just to satisfy the dependency on ping [0], the other two 
concurrent code blocks execute five parallel iterations each. We can interpret each 
of the iterations as a task, the wait and notify statements express dependencies. 
The YML interpreter generates a DAG as depicted in Fig. 3, where each parallel 
block and each parallel loop iteration becomes a task. Each of the leaf tasks executes 


Listing 6 YvetteML example 


compute add(result, 1.0, 2.0); # result <- 1 + 2 
par 
notify (ping[0]) ; 
// 
par (i:=0;4) 
do 
wait (ping[i]); 
compute add(result, result, result); 
notify (pong [i] ) ; 
enddo 
// 
par (i:=0;4) 
do 


wait (pong[i]); 
compute add(result, result, result); 
notify (ping[i+1]); 
enddo 
endpar 
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—— tasktree ——>» notify-wait 


Fig. 3 Graph of tasks as defined by the YvetteML code in Listing 6. The task nodes representing 
the inner task executing the compute add are represented by the result of their computation 


one of the add functions and the vertices represent the dependencies expressed by 
the notify and wait statements. Due to the alternating dependencies between 
the tasks, this program at the end executes sequentially. 


7 Unite and Conquer Approach Using YvetteML 


The Unite and Conquer approach was introduced by Emad et al. [6]. The principle of 
this approach is to make the collaboration of several iterative methods to accelerate 
the convergence of one of them. This approach can be seen as a model for the 
design of numerical methods by combining different computational components 
to work for the same objective, with asynchronous communication among them. 
Unite implies the combination of different computational components, and conquer 
represents different components work together to solve one problem. Different 
independent components with asynchronous communications can be deployed on 
various platforms such as P2P, cloud and supercomputer systems. The idea of 
mixing asynchronously restarted Krylov methods using distributed and parallel 
computing was initially introduced by Guy Edjlali and Serge Petiton [4, 5]. 
They experimented those hybrid Krylov methods asynchronously on networks of 
heterogeneous parallel computers (e.g., using two Connection Machines, a CM5 
and a CM200 and a network of workstations). 
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Dividing iterative methods into components coupled with asynchronous com- 
munication, as suggested in the Unite and Conquer approach, introduces both 
numerical and parallel benefits for the components. 

Numerical benefits: for conventional deflation and polynomial preconditioned 
methods, the information used is obtained from previous Arnoldi reduction, and 
it might be difficult to explore larger subspace. Therefore, the convergence might 
be slowed down. For the methods implemented with the proposed paradigm, the 
solving and preconditioning parts are independent. This information applied to 
the deflation or polynomial preconditioned Solver Components can be different 
from their own Arnoldi reduction, which improves the flexibility of the algorithms, 
e.g., much more eigenvalues and larger searching space for the deflation. Hence 
the limitation of spectral information caused by restarting might be broken down, 
and faster convergence might be obtained. The numerical benefits for linear and 
eigensolver are already respectively discussed in [7, 14]. 

Parallel benefits: parallel performance of iterative methods can be improved by 
the asynchronous promotion and reduction of synchronizations and global commu- 
nications, especially the synchronization points for the preconditioning. Separating 
components improves also the fault tolerance and reusability of algorithms. 


7.1 UCGLE 


UCGLE (Unite and Conquer GMRES/GMRES-LS method) is a linear equation 
solver implementation based on the Unite and Conquer approach. It composes 
mainly three computing components: ERAM, GMRES (Generalized Minimal 
Residual method), and LS (Least-Squares polynomial method). The GMRES 
component is used to solve the systems, the LS and ERAM components work as 
the preconditioning part. The asynchronous communication of this hybrid method 
among three components reduces the number of overall synchronization points and 
minimizes global communication. The work-flow of UCGLE with three computing 
components: The ERAM component computes the desired number of dominant 
eigenvalues, and then sends them to LS component; the LS component uses these 
received eigenvalues to generate a new residual vector, and sends it to the GMRES 
component; the GMRES component uses this residual as a new restarted initial 
vector for solving the non-Hermitian linear systems. Figure 4 shows the better 
convergence acceleration of UCGLE compared with preconditioned GMRES. The 
convergence of UCGLE is accelerated by the LS polynomial preconditioning. 

For the use-case of multiple right-hand sides, Wu and Petiton extend this method 
to m-UCGLE [14]. The m-UCGLE approach furthermore splits the problem into 
blocks, which are solved individually while feeding their results asynchronously 
into the computation of the other blocks. This loosely synchronized blocking 
approach is supported by the general asynchronous feedback loop in the UCGLE 
approach. Overall this method shows better scalability than other approaches while 
still profiting from the improved convergence behavior of the UCGLE method. 
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Fig. 4 Convergence comparison of UCGLE method vs. classic GMRES 


7.2 Extending YvetteML to Support m-UCGLE 


With the current version of YML, the implementation of an m-UCGLE method is 
not possible due to two limitations: There is no mean for asynchronous communica- 
tion in YML as needed for the asynchronous feedback loop. YML also provides no 
way to break early from a YML loop. The latter would be needed to stop iteration 
at a convergence condition. To make the control flow depending on asynchronous 
communication, it is necessary to break at multiple levels. Therefore, we propose 
different kinds of exiting a parallel branch in YML: 


1. the application may exit the parallel branch if all the running tasks are completed, 
e.g., if there are several BGMRES components in parallel to solve linear systems, 
this parallel section should be exit if all the BGMRES component achieve the 
convergence; 

2. the application may exit the parallel branch if only one task among all is 
completed, e.g., in the MERAM algorithm, several ERAM components are 
executed in parallel to approximate the eigenvalues of a matrix, if one of these 
components approximates enough eigenvalues, the whole parallel section should 
be exited; 

3. the application may exit the parallel branch if only several tasks among all are 
completed; 

4. for the application with multi-level parallelism, we may decide to exit several 
levels of parallel branches; and 

5. the application may exit with saving selected data into the local filesystems, 
which will improve its fault tolerance and reusability, e.g., lsparams generated by 
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the B-LSP Component could be saved into local, which will be used for solving 
the linear systems in future. 


By use of the different ways to exit a parallel branch, a unite and conquer algorithm 
could be implemented with YML. The latter point would even introduce resilience 
to the YML implementation allowing efficient checkpoint and restart to be defined 
in the YML description. 


8 FP2C 


FP2C (Framework for Post-Petascale Computing) is a development and execution 
environment which supports multi-program methodologies across multiple archi- 
tectural levels as suggested by Dufaud et al. [3]. FP2C integrates XMP to describe 
tasks into the workflow environment YML. Therefore FP2C is an implementation of 
YML to be executed on classical HPC clusters. FP2C is composed of three layers: 


1. workflow programming, 
2. parallel and distributed programming, and 
3. shared-memory parallel programming/accelerator. 


The tasks are expected to be executed on sub-clusters or groups of nodes which 
are tightly connected. These tasks would be hybrid programs with distributed and 
shared programming models. The workflow scheduler among the sub-clusters or 
groups invokes and manages the tasks. 

The YML backend implementation used for this configuration is OmniRPC-MPI 
to allow dynamic creation and control of MPI processes needed to executed the 
YML tasks on an HPC cluster. OmniRPC-MPI is an extension of OmniRPC [12], 
which supports remote procedure call (RPC) in a grid environment. When the 
OmniRPC-MPI receives requests to invoke remote programs or to execute tasks 
on the remote programs, then it handles the requests by calling MPI function such 
as MPI_Comm_spawmn to create new processes for the task or MPI_ Send to notify 
existing, available processes about the new task. 

Figure 5 depicts the execution of a workflow with FP2C. Initially, mpirun only 
starts the process for the YML scheduler. The scheduler loads the task graph and 
starts executing the YML program by creating and scheduling YML tasks. Using 
MPI _Comm_spawn, the scheduler creates remote programs with the required 
number of processes to execute a specific task. To avoid the overhead of process 
startup and shutdown, the scheduler can reuse an existing group of processes to 
schedule another task, when the previous task is finished like depicted for task2 
and task3. By the use of MPI point-to-point communication, the remote program 
is informed about the next task to execute and also communicates back about 
the completion of a task. If some YML tasks need a different number of parallel 
processes than the previously finished task, FP2C will terminate the remote program 
to spawn new remote programs as depicted for task1, which is replaced by smaller 
remote programs to execute task5 and task4. 


562 J. Protze et al. 


nodeO nodel node2 node3 node4 


mpirun MPI_Comm_spawn 


remote program1 remote program2 


<task2> 
<task1> #pragma xmp loop on t(i) 
#pragma xmp task on p(5) 


MPI Barrier 


(wait) (wait) 


7% Aajnpayos jw 


remote remote 
program3 program4 Ziska 


MPI_Send(...) 
<task5> <task4> MPI_Allreduce(...) 


— invocation - - > communication 


Aeaqi| IdIN -Dduw 


Fig. 5 Execution of an mSPMD application described in the YML control and data flow language 
in the FP2C implementation 


9 Correctness Checking for YML 


Correctness checking touches YML at multiple levels: 


e The language Yvette, that expresses the graph semantics within YML, is quite 
similar to the hardware description language Esterel and suffers from similar 
correctness issues. In this section, we will especially cover potential deadlocks 
and data races as well as some semantic issues that come with this language. 

¢ The runtime system implementation of YML could also be subject of correctness 
analysis. The challenge is then to distinguish the behavior of the YML runtime 
system from application behavior to minimize the analysis overhead. 

e Finally, since YML expresses a workflow and runs various modules, it can be of 
interest to analyze the individual modules separately for correctness. 


9.1 Programming Errors in YML Description 


With the current specification of YML, the graph defined by a YML graph descrip- 
tion can be statically built and therefore also statically analyzed. We identified 
various possible error patterns in graph descriptions. Possible errors include the use 
of undefined variables, type miss-match for a variable, but also deadlock due to wait 
conditions which never receive a signal. Due to the static and self-contained nature 
of the graph description language, even the possible deadlocks can be identified 
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statically with data flow analysis. The analysis for those programming errors should 
be integrated into the YML compiler. 


9.2 Challenges of Analyzing the YML Runtime System 


Analyzing the runtime system of YML has two mayor challenges for a runtime 
analysis tool like MUST, which is developed to support the analysis of common 
HPC applications. The first challenge is to understand the difference between code 
that represents the YML runtime system and code that belongs to the application 
code, in this context the YML task code. The bigger challenge is the dynamic MPI 
characteristic of the YML runtime system, which dynamically creates processes 
using MPI_Comm_spawn, that are then integrated into the execution and should 
also be supervised by the analysis tool. The analysis tool would also need to 
understand the resulting new MPI communicators as well as the communication 
patterns with those spawned processes. 

Supporting an application that exposes such dynamic behavior is currently not 
supported by MUST and the underlying TBON communication layer. The tool 
would dynamically need to decide about additionally needed analysis processes to 
extend the TBON. Creating a TBON infrastructure which supports such dynamic 
application behavior might be subject of a future project. 


9.3 Correctness Checking Integrated into FP2C 


Since each YML task, invoked by the YML runtime, can be a complete parallel 
program, such task can have any issue which can also be found in parallel programs. 
Therefore a developer might want to analyze individual tasks for parallel correctness 
to identify issues like deadlocks or data race within a task. We introduce a new 
option for the definition of compute functions into the YML description, which 
allows applying an analysis tool like MUST to specific YML tasks. 

For those selected tasks, the YML scheduler needs to launch additional processes 
to execute the distributed and centralized analysis of MUST as depicted for remote 
program? in Fig. 6. In this specific example, MUST executes both kinds of analysis 
in a single process. Before launching the FP2C application, the MUST infrastructure 
needs to be prepared for the execution with each task configuration, which would 
be done by the must run execution wrapper for a normal MPI or XMP application. 
For the execution of a YML task with applied MUST analysis, the remote program 
controlled by FP2C then needs to select the appropriate prepared configuration of 
MUST, which is typically done by exporting some environmental variables. 

Since we specifically want to analyze the YML task, but not the YML infras- 
tructure, the MPI functions called to implement the FP2C command and control 
workflow should be ignored by the analysis tool. Some of those MPI functions are 
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Fig. 6 Integration of MUST into FP2C: the YML scheduler launches selected YML tasks under 
the control of MUST runtime correctness checking. MUST analyses are applied only to parallel 
programming constructs used in the task code, but not to MPI communication utilized by the FP2C 
command and control workflow 


used to communicate with the YML scheduler, which is outside of the process 
controlled by MUST. Such communication to the outside would confuse some 
analysis performed by MUST. We can avoid analysis of such functions by directly 
calling into the PMPI interface for functions that implement FP2C functionality. 
Circumventing MUST analysis for all FP2C-owned MPI communication can result 
in a deadlock: As Fig.5 shows, FP2C will execute a barrier at the end of task 
execution to ensure that all processes finished the execution of the task. For 
native FP2C execution, it is valid to use MPI_COMM_ WORLD for this barrier. The 
MUST analysis process does not know about the barrier and the execution will 
therefore stall. With FP2C using PMPI calls, MUST will not be able to replace 
MPI COMM WORLD by a communicator representing the application processes as 
it was described in Sect. 2.2. The application processes cannot pass the barrier and 
the MUST process waits for new messages from the application processes. We 
could fix this issue by deriving fp2c_world from MPI_COMM WORLD using 
MPI_Comm_dup as shown in Fig. 6. The £p2c_world communicator can then 
safely be used by FP2C in PMPI communication calls which are limited to the 
application processes. 

Another challenge when applying MUST to YML tasks is to deal with 
the output files of MUST. By default, MUST assumes that it is applied to a 
single MPI application and will write an output file to the current working 
directory. With FP2C we apply MUST to various YML tasks. To enable the 
application developer to associate the error report to a specific YML task, we 
should write the MUST output to a different file per task. The MPI specification 
defines int MPI Pcontrol(const int level, ...) to allow flexible 
interaction between MPI application and PMPI tool. It is the responsibility 
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of the tool to interpret and define the arguments passed to this variadic 
function. For our use case, we defined usage with level=8192 and signature 
int MPI _Pcontrol(const int level, const char» filename) 
to indicate that the analysis results of subsequent application events should be 
written to the new file name. We also want to make sure that distributed analysis for 
application events before the pcontrol call write the report into the old file. Therefore 
we require this pcontrol function call to be collective on the whole application. 
Currently, we do not require to finish all MPI communication at this point. In the 
future, we might add some additional pcontrol commands to express certain runtime 
assertions. Such assertions might include that no outstanding messages are expected 
or all MPI handles should be released at a certain point. 


10 Conclusion 


In this paper, we discussed how we can apply runtime correctness checking to 
emerging multi-level parallel programming languages which try to encounter the 
challenges of multi-level concurrency of exascale systems. Specifically, we looked 
into possible correctness issues in XMP applications, which represent the field of 
PGAS languages. We described how we integrated runtime correctness analysis for 
XMP applications into the runtime correctness checking tool MUST and therefore 
specified the new tools interface XMPT for XcalableMP. The workflow description 
language YML allows to introduce another level of high-level concurrency and 
therefore better exploit the massive available concurrency of exascale systems. As 
an example application for such a high-level concurrency workflow, we introduced 
the unite and conquer method UCGLE. This method improves the convergence 
behavior of certain solvers of linear equation systems by asynchronously exchang- 
ing intermediate results of preconditioner and solver. We introduced FP2C as an 
implementation of YML targeting HPC systems. We showed how we could integrate 
MUST runtime analysis to be applied to certain tasks scheduled by the FP2C 
runtime system and discussed solutions for challenges on the way to a successful 
workflow. 
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Abstract Simulation of mantle convection on planetary scales is considered a 
grand-challenge application even in the exascale era. The reason being the enormous 
spatial and temporal scales that must be resolved in the computation as well as 
the complexities of realistic models and the large parameter uncertainties that need 
to be handled by advanced numerical methods. This contribution reports on the 
TerraNeo project which delivered novel matrix-free geometric multigrid solvers for 
the Stokes system that forms the core of mantle convection models. In TerraNeo the 
hierarchical hybrid grids paradigm was employed to demonstrate that scalability can 
be achieved when solving the Stokes system with more than ten trillion (1.1 - 1013) 
degrees of freedom even on present-day peta-scale supercomputers. Novel concepts 
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were developed to ensure resilience of algorithms even in case of hard faults and 
new scheduling algorithms proposed for ensemble runs arising in Multilevel Monte 
Carlo algorithms for uncertainty quantification. The prototype framework was used 
to investigate geodynamic questions such as high velocity asthenospheric channels 
and dynamic topography and to perform adjoint inversions. We also describe the 
redesign of our software to support more advanced discretizations, adaptivity, and 
highly asynchronous execution while ensuring sustainability and flexibility for 
future extensions. 


1 Introduction and Motivation 


Geodynamics 

Mantle convection is a critical component of the Earth system. Although composed 
of rocks, the Earth’s mantle behaves like a highly viscous fluid on geological 
time-scales. Its motion is driven by internal heating due to radioactive decay 
and by substantial heating from below. The latter stems from the release of 
primordial heat stored in the core from the time of the planet’s accretion. The 
mantle convection currents are largely responsible for many of Earth’s surface 
tectonic features, forming mountain chains and oceanic trenches through plate 
tectonics, and contributing significantly to the accumulation of stresses released 
in inter-plate earthquakes. Hence, a thorough quantitative understanding of mantle 
convection is indispensible to gain further insight into these processes. However, 
besides such sort of fundamental questions, mantle convection does also have direct 
influence on some societal and commercial issues. Viscous stresses caused by 
up- and downwellings in the mantle induce dynamic topography, i.e. they lead to 
elevation or depression of parts of Earth’s surface. Reconstructing the latter and the 
associated sea-levels back in time is crucial for localisation of oil-reservoirs and the 
determination of future sea-level rises caused by climate change. 

Although the basic equations for describing the mantle convection process are 
not in question, resulting from the force balance between viscous and buoyancy 
forces and conservation of mass and energy, key system parameters, such as the 
buoyancies and viscosities, remain poorly known. In particular the rheology of 
the mantle, which is a fundamental input parameter for geodynamic models, is 
not well known. Studies based on modeling the geoid e.g. [87], the convective 
planform e.g. [21], glacial isostatic adjustment e.g. [72, 80], true polar wander e.g. 
[82, 86, 89] and plate motion changes e.g. [57] consistently point to the need for 
a significant viscosity increase between the upper and the lower mantle. But the 
precise form of the viscosity profile remains uncertain. Commonly the viscosity 
profiles display a peak in the mid lower mantle [81, 85], or involve a rheology 
contrast located around 1000 km depth [93], or favor an asthenosphere with a strong 
viscosity reduction to achieve high flow velocities and stress amplification in the 
sublithospheric mantle [51, 102]. Geodynamic arguments on the uncertainties and 
trade-offs in the viscosity profile of the upper mantle have been summarized recently 
by Richards and Lenardic [88]. 
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Fig. 1 Velocity flowlines under the Himalaya mountain range; snapshot from a global convection 
model simulated with hierarchical hybrid grids 


Since the mantle is not directly accessible to observations and laboratory 
experiments can hardly reproduce the relevant temperatures, pressures, and time- 
scales, computer simulation is vital for studying mantle convection. An example 
is given in Fig. 1. One of the challenges is the enormous spatial scales that must 
be resolved. While the mantle has a thickness of roughly 3000km, important 
features, such as subducting slabs might only have a width of several kilometers. 
Additionally, realistic simulations will require a very large number of time-steps. 
Furthermore, fully realistic models of the mantle must deal with many complexities 
such as non-linear viscosity models, phase transitions, or the treatment of the non- 
constant hydrostatic background density of the mantle, see e.g. [50]. 


TERRANEO 

TERRANEO! aims to design and realize a new community software framework for 
extreme scale Earth Mantle simulations. We have a special constellation where 
groups from Geophysics, Numerical Mathematics, and High Performance Comput- 
ing collaborate towards a unique co-design effort. The first 3-year funding period 
has already been summarized in [7]. Initially, the team had focussed on fundamental 
mathematical questions together with general scalability and performance issues, as 
documented in [38, 40-43]. In particular it could be shown that even with peta-scale 
class machines, computations are possible that resolve the global Earth Mantle with 
about | km resolution, requiring the solution of indefinite sparse linear systems with 
more than 10!? degrees of freedom. These very large computations and scaling 
experiments were performed with a prototype software that was implemented 
using the pre-existing hierarchical hybrid grids (HHG) multigrid library [12, 13]. 
This step was necessary to gain experience with parallel multigrid solvers for 


"http://terraneo.fau.de. 
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indefinite problems and the specific difficulties arising in Earth Mantle models. 
While this approach led to quick first results, the prototype character of HHG 
implied several fundamental limitations. Therefore, the second funding period is 
being used for a fundamental redesign of HHG under the new acronym HYTEG for 
Hybrid Tetrahedral Grids. The goal is to leverage the lessons learned with HHG 
for the design of a new, extremely fast, but also flexible, extensible, and sustainable 
Geophysics software framework. 

Additionally, research on several exascale topics was conducted in both funding 
periods. These include resilience, inverse problems, and uncertainty quantification. 
Detailed results on these topics, as well as on the new HYTEG software architecture, 
will be reported in the following sections of this report. 


Multigrid Methods 

Multigrid methods belong to the set of fastest solvers for sparse linear systems 
that arise from the discretization of PDEs and are, thus, of central importance for 
exascale research. Their invention and their popularization e.g. in [17] constitute one 
of the major breakthroughs in numerical mathematics and computational science. 
To illustrate their relevance for exascale computing, we summarize here a thought 
experiment from [90]. 

Assuming that systems with N = 10!, ie. a trillion degrees of freedom 
(DoF) would be solved by classical Gaussian elimination, this would require the 
astronomical number of operations of 2/3N* = 2/3 x 10°°. Better elimination 
based algorithms can exploit the sparse matrix structure. A typical method of this 
class, such as nested dissection would still require around 10*4 operations. For such 
a work load, a modern, fast PC with a performance of 100 GFlops would need 
more than 100,000 years of compute time. Note that solving one such system is 
still only worth one time step of many. This line of thought may be seen as just 
another argument why Earth Mantle Convection is a grand challenge problem and 
why e.g. the authors of [23] write: 


A uniform discretization of the mantle at for instance | km resolution would result in meshes 
with nearly a trillion elements, which is far beyond the capacity of the largest available 
supercomputers. 


It is important to realize that parallel computing alone does not solve the problem. 
Even the fastest German supercomputer, currently SuperMUC-NG*, would still 
need longer than a year of compute time (if it did not run out of memory before) to 
execute 10** operations. 

In the TERRANEO project we have demonstrated, and we will report below, that a 
well designed massively parallel multigrid method is indeed capable of solving such 
large systems with a trillion unknowns in compute times in the order of seconds. 
This is fundamentally based on their fast convergence, for many types of problems, 
that leads to asymptotically optimal complexity solvers, when it is combined with a 
nested iteration in the form of a so called full multigrid method [17, 41]. Based on 
this, it comes as little surprise that multigrid methods are employed in several of the 


*https://doku.Irz.de/display/PUBLIC/SuperMUC-NG. 
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SPPEXA projects, such as e.g. [3, 4, 24, 69, 75] with excellent success for a wide 
variety of applications. 

In the literature on multigrid methods of the past two decades, much attention 
has been given to so called algebraic multigrid methods (AMG) that do not operate 
on a mesh structure, but that attempt to construct the necessary hierarchy based on 
analyzing the sparse system matrix. These methods can exhibit excellent parallel 
scalability [3] and are often used as components within other parallel solvers, such 
as domain decomposition methods [62]. AMG methods have several advantages, 
most importantly that they can be interfaced to an application via classical sparse 
matrix technology. They can also save the application developer from worrying 
about the necessary solution process and many numerical analysts who design novel 
discretizations have taken this perspective. However, AMG methods work only well 
for certain types of systems, and consequently users of these new discretization 
techniques now find themselves being trapped with linear systems for which no 
efficient parallel algorithms are available, yet. Few methods devised in the applied 
mathematics community can demonstrate even solving systems with 10° degrees of 
freedom, falling three or four orders behind of what we can demonstrate here. 

Additionally, the convenience of algebraic multigrid also comes at the price of 
a loss in performance. Algebraic multigrid methods have only a limited scope of 
applicability, and additionally they often lose an order of magnitude in performance 
in their expensive setup phase. Of course they are inherently also not matrix-free. 
As we will discuss below, matrix-free methods are essential to reach maximal per- 
formance. Geometric multigrid methods can be realized as matrix-free algorithms, 
potentially leading to another order of magnitude in performance improvement. For 
this reason we have invested heavily in new matrix-free methods [8, 9, 11], similar 
to other SPPEXA projects [5, 67]. 

The price of using geometric multigrid lies in the more complex algorithms 
which often have to be tailored carefully to each specific application, and the 
significantly more complex interfaces that are needed between the other software 
components and the solver. This in turn creates the need for new software technolo- 
gies as are being developed in the HYTEG framework. 


2 Basic Ideas and Concepts 


As mentioned above the mathematical-physical model of mantle convection is 
based on the force balance between viscous and buoyancy forces and conservation 
of mass and energy. The resulting general system of equations is e.g. given in 
[7, 105]. Expressing these in terms of the central quantities of interest, i.e. velocity, 
pressure and temperature results in a coupled system composed of a stationary 
Stokes component and a time-dependent equation for the temperature. The former 
constitutes the most expensive part in these kinds of simulations and we will, thus, 
concentrate on this aspect mostly in the following. In the case of the Boussinesq 
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Table 1 Physical quantities and their representing symbols 


Ro | Earth’s radius po | Reference density g | Gravitational 
acceleration 
n Dynamic viscosity a | Thermal expansivity k | Heat conductivity 


AT | Temperature difference between core-mantle-boundary and surface 


approximation the Stokes part is given in non-dimensional form by 


—div(n(Vu + (Vu)")) + Vp = —RaT e, (1) 
div(u) = 0 (2) 


with dimensionless velocity u, pressure p and the unit vector in radial direction e,. 
The Rayleigh number Ra describes the vigor of the convection and is given by 


agpoAT Re 


KN 


Ra= 


A description of the physical quantities represented by the symbols is given 
in Table 1. The finite element discretization that we consider for the stationary 
equations is backed by the concept of HHG. Initially, the domain is partitioned into 
an unstructured mesh of tetrahedra. In a second step, each tetrahedron is iteratively, 
uniformly refined in a way that a block-structured, tetrahedral mesh is created 
[12, 13, 40]. This refinement strategy results in a grid hierarchy T := {T¢, £ = 
0,1,..., L} that allows for a canonical implementation of geometric multigrid 
methods. 

The computational domain is distributed to the parallel processes by means of the 
unstructured coarse grid elements. To realize efficient communication procedures, 
the mesh is enhanced by so-called interface primitives. For each shared face, edge, 
and vertex an additional macro-primitive is allocated in the data structure. Together 
with the macro-tetrahedra, each primitive is assigned to one parallel process. 
Starting from a two times refined mesh, each macro-primitive has at least one inner 
vertex. 

We employ conforming finite elements on the domain Q and define the function 
spaces of globally continuous, piecewise polynomial functions with polynomial 
degree d on level £ as 


SË (Q) = {v €C(Q): vir € Pa(T), VT € To}. 
For the majority of our experiments and simulations we implement a stabilized 


P1-P1 discretization for the Stokes equation, using the velocity and pressure finite 
element spaces 


Vex Qe = [si (8N H| x SHQ) N L2(Q). 
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Here, L$(Q) := {q € L?(Q) : (q, ljg = 0} with (-, -)g being the inner product 
in L?(Q). The P1-P1 element pairing does not fulfill the LBB-condition and we 
employ a PSPG stabilization [19, 33]. It follows the standard weak formulation for 
the discretized Stokes problem: find (ug, pe) € Ve x Qe such that 


a(ug, Ve) + b(ve, Pe) = Ft (ve) Vwe € Ve, 
b(ue, ge) — cede, Pe) = ge (qe) Vae € Qe, 
where 
a(u, v) := 2(nD(u), D(v))a 


b(u, q) := — (div u, g)a 
fO) := (£, vio, 


for all u, v € Ag (2)? and qE LQ), The stabilization terms are defined as 


1 
ce(qe, po) = $ 5f dx)" (V pe, Vae)r 
TET; T 


1 
ga) =- DD zf. dx)'? (f, Vqe)r. 


TET? 


The discrete formulation of (3) can then be expressed as the system of linear 


equations 
T 
aeea i 
(; B —C/ \p g K 


A description of the general form of the temperature equation and our work on 
its discretization and time-stepping can be found in [7]. 


3 Summary of Project Results 


3.1 Efficiency of Solvers and Software 


A rigorous quantitative performance analysis lies at the heart of systematic research 
in large scale scientific computing. The systematic analysis of performance is central 
to the research agenda of computational science [91], since it defines how successful 
computational models are. In this sense, performance analysis here means more than 
measuring the speedup of a parallel solver or studying its weak and strong scaling 


576 S. Bauer et al. 


properties. Beyond this, modern scientific computing must attempt to quantify 
numerical cost in metrics that are independent of the run-time of an arbitrary 
reference implementation. Most importantly, the numerical cost must be set in 
relation to what is numerically achieved, i.e. the accuracy that is delivered by a 
specific computation. 

As one step in this direction, we extend Achi Brandt’s notion of textbook 
multigrid efficiency (TME) to massively parallel algorithms in [41]. Using the finite 
element based prototype multigrid implementation of the HHG library, we have 
employed the TME paradigm for scalar linear equations with constant and varying 
coefficients as well as to linear systems with saddle-point structure. A core step 
is then to extend the idea of TME to the parallel setting in a way that is adapted 
to the parallel architecture under consideration. To this end, we develop a new 
characterization of a work unit (WU) in an architecture-aware fashion by taking 
into account modern performance modeling techniques, in particular the standard 
Roofline model and the more advanced ECM model. The newly introduced parallel 
TME measure is studied in [41] with large-scale computations and for solving 
problems with up to 200 billion unknowns. 


3.2 Reducing Complexity in Models and Algorithms 


When striving for optimal efficiency, it is essential to choose models and discret- 
ization schemes whose incurred computational cost is as low as possible. Within 
TERRANEO we have therefore analyzed and improved the discretization schemes 
under study. A restricting factor was initially that only simple conforming dis- 
cretizations are feasible in the prototype HHG library. E.g. discontinuous Galerkin 
discretizations, as in [6, 69], were not yet feasible in HHG. In particular, buoyancy- 
driven flow models thus demand a careful treatment of the mass-balance equation 
to avoid spurious source and sink terms in the non-linear coupling between flow 
and transport. In the context of finite-elements, it is therefore commonly proposed 
to employ sufficiently rich pressure spaces, containing piecewise constant shape 
functions to obtain local or even strong mass-conservation. In three-dimensional 
computations, this usually requires nonconforming approaches, special meshes or 
higher order velocities, which make these schemes prohibitively expensive for 
some applications and complicate the implementation into legacy code. In [39], we 
propose and analyze a lean and conservatively coupled scheme based on standard 
stabilized linear equal-order finite elements for the Stokes part and vertex-centered 
finite volumes for the energy equation. We show that in a weak mass-balance it 
is possible to recover exact conservation properties by a local flux-correction which 
can be computed efficiently on the control volume boundaries of the transport mesh. 
Furthermore, we discuss implementation aspects and demonstrate the effectiveness 
of the flux-correction by different two- and three-dimensional examples which are 
motivated by geophysical applications in [101]. 
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In the special case of constant viscosity the Stokes system can be cast into 
different formulations by exploiting the incompressibility constraint. For instance 
the strain in the weak formulation can be replaced by the gradient to decouple the 
velocity components in the different coordinate directions. Thus the discretization 
of the simplified problem leads to fewer nonzero entries in the stiffness matrix. 
This is of particular interest in large scale simulations where a reduced memory 
footprint and accordingly reduced bandwidth requirement can help to significantly 
accelerate the computations. In the case of a piecewise constant viscosity, as it 
typically arises in multi-phase flows, or when the boundary conditions involve 
traction, the situation is more complex, and the cross derivatives in the original 
Stokes system must be treated with care. A naive application of the standard 
vectorial Laplacian results in a physically incorrect solution, while formulations 
based on the strain increase the computational effort everywhere, even when the 
inconsistencies arise only from an incorrect treatment in a small fraction of the 
computational domain. In [55], we present a new approach that is consistent with the 
strain-based formulation and preserves the decoupling advantages of the gradient- 
based formulation in iso-viscous subdomains. The modification is equivalent to 
locally changing the discretization stencils, hence the more expensive discretization 
is restricted to a lower dimensional interface, making the additional computational 
cost asymptotically negligible. We demonstrate the consistency and convergence 
properties of the new method and show that in a massively parallel setup, the 
multigrid solution of the resulting discrete systems is faster than for the classical 
strain-based formulation. 


3.3. Stokes Solvers and Performance 
3.3.1 Multigrid Approaches for the Stokes System 


In this subsection we briefly review different classes of solvers for Stokes type 
systems involving a multigrid component. One characteristic feature is the indefinite 
structure and thus special care in the design of the solver is required. Numerical 
experiments and performance studies were originally performed with the HHG 
software and are thus mostly limited to stabilized conforming P1-P1 discretizations. 
We point here also to the possibility to generate such parallel solvers automatically 
and achieve similarly good scalability and efficiency results, as demonstrated in the 
ExaStencils project [70, 75]. A comprehensive summary of the supercomputers used 
for the simulations conducted in this project is listed in Table 2. 

In TERRANEO three classes of solvers we consider are a preconditioned Krylov 
method for the indefinite system, a preconditioned Krylov method for the positive 
definite pressure based Schur complement and a monolithic (sometimes also called 
an all-at-once) multigrid solver for the indefinite system. While in the first two 
approaches the multigrid solver is only applied to the velocity component, the 
monolithic multigrid scheme works simultaneously on the velocity and the pressure 
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Table 2 Characteristics of the different supercomputers used for simulations presented in this 
publication 


SuperMUC Phase 1 
(Thin nodes) SuperMUC Phase 2 | Juqueen Hazel Hen 


Operation 2012-2018 2015-present 2012-2018 | 2015-present 


# nodes 


Intel Sandy Bridge | Intel Haswell 
ES5-2580 E5-2697v3 
8 Core 14 Core 


CPU frequency 
(GHz) 


# total cores 147,456 80,016 458,752 185,088 
Interconnect Infiniband FDR10 | Infiniband FDR14 | 5D Torus i 


Total memory 288 194 448 
(TByte) 
Linpack (PFlop/s) 


Nodes 
s | 80 [2.7 x10" fio | 685.88 
40 703.69 
320 741.86 
3560 | 40,960 17x102 | 9 | 720.4 

30480 [327680 [11x10 |9 [77609 


component. A systematic quantitative performance study of all three approaches on 
modern peta-scale systems is given in [43]. More than 750,000 parallel threads are 
used in the largest simulation runs. Our main finding is that the all-at-once multigrid 
scheme outperforms the two alternatives. It does not only show the smallest memory 
footprint, but also results in the shortest compute time. More than 10 trillion? 
unknowns (> 10!) have been solved for on a Blue Gene/Q system, see Table 3. 

For such huge systems a sophisticated matrix-free code design with minimized 
memory traffic is a must. Although the use of GMRES type Krylov methods for 
the indefinite system is quite popular due to its robustness, it is not an option for 
us. The non-optimal memory requirement and compute time per iteration before 
the next restart slows down the overall performance drastically and restricts the 
feasible system size. To test the flexibility of the solver, we consider not only a box 
as geometry but also a thick spherical shell motivated by our application, a pipe filled 
with spherical obstacles of different diameters and a simplified artery, see Fig. 2. 
The physical system under consideration then ranges from a thermo-mechanically 
coupled system to a non-linear viscosity model of Carreau type. 


3W.r.t. the short scale naming system. 
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Fig. 2 Left to right: concentration in a lid driven cavity type setting; streamlines and speed in a 
benchmark problem; streamlines for a non-linear dynamic viscosity model and stationary flow in 


a pipe 


3.3.2 Smoothers for Indefinite Systems 


There are different strategies to construct smoothers for the coupled systems. 
Among the popular ones are Braess—Sarazin, Vanka, and Uzawa type approaches. 
While the first two types consider a saddle point problem either in a global or a 
local form as starting point, the Uzawa type scheme is based on a factorization of the 
saddle point system in lower and upper triangular matrices, see [16, 77, 98, 99, 104]. 
The advantage of the Uzawa version is clearly a smaller FLOP count and a reduced 
communication traffic compared to the two mentioned alternatives. Also in contrast 
to the Vanka smoother where local patches have to be considered it can be applied 
node-wise. Let us assume that A is symmetric and positive definite and A and § 
satisfy 


A>A, $>S:=CBA'B" 
then the following smoothing property holds 
IKM” < V210 — D IDI, 


where v is the number of smoothing steps, || - || a suitable operator norm and 


aoe pei 28 _ (Ao it p 
M := Id oe = (48). n= 5 (6 yay): 


A mathematically rigorous proof is given in [29]. Following the abstract framework 
provided by Larin and Reusken [73], the following identity turns out to be of crucial 
importance 


KM = Mi KM?! M3 
with suitable operators Mı, M2 and Ms being the error propagation operator 


of the standard Uzawa. We note that in comparison to earlier theoretical results 
[95, 106, 107], we do not require symmetry of the Uzawa type smoother and thus 
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Table 4 Number of 4 6 8 


v 
iterati to red th 
aie ae DoFs Iter | Time [s] | Iter | Time [s] | Iter | Time [s] 


(Intel Xeon) 14x10? |13 | 0.10 [9 0.07 |7 0.06 
14x10 |12 | 021 |9 0.18 |7 0.15 
12x10% |12 | o61 |8 0.51 [6 | 0.44 
10x106 |11 | 2.44 |8 2.33 |6 2.16 
g.2x 10° |11 | 14.54 |8 | 1458 l6 | 14.03 
66x10’ |11 |10266 |7 | 92.09 |6 | 101.90 
5.3.x 108 |10 |700.38 |7 |693.75 |6 | 769.34 


the computational cost is reduced from two applications of the velocity smoother 
to one in each iteration. The achieved upper bound for the smoothing property 
is sufficient to guarantee level independent W-cycle or variable V-cycle results. 
However, in large scale applications V-cycle schemes are much more attractive than 
W-cycle based multigrid methods. The number of coarse solves per iteration is one 
in case of the V-cycle but growths exponentially with the depth of the multigrid 
hierarchy in case of a W-cycle. Unfortunately, there is no V-cycle theory for this 
all-at-once multigrid method available, and moreover numerical experiments show 
that the convergence rate deteriorates for a fixed and level independent number of 
smoothing steps per level and an increased level count. As a compromise a mildly 
variable V-cycle turns out to perform best. The HHG data structure and the matrix- 
free concept allows us, even on a standard workstation, to test the solver for systems 
with more than 100 millions of unknowns. The computations reported in Table 4 are 
performed on an Intel Xeon CPU E3-1226 v3, 3.30 GHz with 32 GB memory. For 
this first test only a single core is used and thus the timing does not reflect the parallel 
performance of the solver. For the pressure we use damped Gauss-Seidel applied to 
the operator C where the damping parameter is determined by a power iteration on 
a coarse level and not dependent on the mesh-size, but on the mesh-regularity. 

While the all-at-once multigrid solver requires a special smoother, the Schur- 
complement formulation allows for a natural application of multigrid on the velocity 
component as part of an inner iteration in a preconditioned conjugate gradient 
method for the pressure. In case of an isoviscous flow the Schur complement 
is spectrally equivalent to the mass matrix and thus the condition number does 
not deteriorate with an increasing number of degrees of freedom. Due to its 
simple structure this iterative solver can be easily implemented by reusing standard 
software components and is thus suitable for a rigorous performance analysis. We 
do not report here in further detail, but an innovative performance analysis can be 
found in [40, 41]. Good scalability on more than half a million parallel threads is 
demonstrated in [42]. Together with the excellent node-level performance, this is 
essential for achieving high performance levels. 


TERRANEO: Beyond a Trillion DoFs 581 


3.3.3 Multigrid Coarse Grid Solvers 


Finally we report on results for agglomeration techniques. In the context of the 
matrix-free HHG implementation it is natural to use a preconditioned Krylov 
method as a coarse grid solver. However, in the context of very large systems this is 
a bottleneck since this coarse solver is non-optimal, and will thus ultimately become 
a limit to scalability. To improve the parallel efficiency for systems beyond a billion 
of unknowns, we proceed in two steps. In a first preprocessing step, the coarse 
grid problem is assembled in a classical fashion storing matrix entries in standard 
compressed row storage format. The system can then be solved by methods from the 
PETSc library, [79]. More specifically, a block preconditioned MINRES iteration is 
applied. For the velocity component we here use an AMG preconditioned conjugate 
gradient (CG) iteration and for the pressure a lumped mass-matrix approach. Here 
still the full parallel system is employed so that only small coarse grid subproblems 
are assigned to each parallel process. In the second step, we therefore propose to use 
agglomeration to reduce the number of parallel processes on the coarsest level and 
to reduce the communication overhead in each iteration. This becomes necessary 
when the ratio between communication time and compute time becomes unbalanced 
and when the overall coarse solve time is severely dominated by communication. In 
Table 5, we report on the parallel efficiency for a weak scaling study on JUQUEEN 
(IBM BlueGene/Q-System, 28 racks, 28,672 nodes, 458,752 cores). The number of 
unknowns ranges from approximately 80 million to over 100,000 million and the 
number of parallel threads is increased by a factor of more than 15,000. Over this 
large range, the parallel efficiency shows only a moderate deterioration and remains 
above 90% even in the largest run. This results mainly from the fact that for the 
largest run we reduce the number of processes on the coarsest level by a factor of 
eight. Using a master-slave agglomeration technique has the advantage of a short 
collecting and distribution phase. 

In Fig. 3 we compare the naive non-optimal coarse solver with the PETSc based 
agglomeration strategy for the largest run, i.e. np = 61,440. We observe that with 
the naive approach roughly half of the compute time is spent in the coarse solve, 
although it has only a small fraction of the total number of unknowns. The situation 
is drastically different for the agglomeration technique in combination with the 
PETSc solver. Here only roughly 5% of the actual time to solve is spent in the coarse 
solver routines. A closer look at the ratio between MPI communication and operator 
computations exhibits that without agglomeration strategy a significant amount of 


Table 5 Parallel efficiency np DoF Red. | T[s] Coarse | Par. eff 
for master-slave 7 
agglomeration technique and 30 [83 x10 L 16.284 _| 0.043 _ | 1.00 
up to more than 50,000 120 |3.3x 108 |1 16.426 |0.050 | 0.99 
parallel threads on 960 |2.6x 10° |1 17.084 |0.171 | 0.95 
JUQUEEN 7680 |2.4x 10! | 1 17.310 | 0.382 | 0.94 

8 


61,440 | 1.7 x 10!! 17.704 |0.877 | 0.92 
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O 


E Coarse Solve (47.4%) 
E Smoothing (46.1%) 
E Residual (4.2%) 

W Grid Transfer (2.2%) 


E Computation (53%) 
H MPI commun. (47%) 


D 


E Smoothing (83.6%) 
E Residual (7.5%) 

E Coarse Solve (5.2%) 
EŒ Grid Transfer (3.7%) 


E MPI commun. (3%) 
E Computation (96%) 


Fig. 3 Ratio of relative time to solve spent in the different multigrid components: a non-optimal 
Krylov based solver without agglomeration (left) and AMG preconditioned Krylov based solver 
with agglomeration 


time is lost by simple communication. The situation is again drastically different 
for the agglomeration strategy. Here the communication overhead is reduced to less 
than 5%. We note that for a smaller system size and a moderate number of parallel 
threads, the difference in the time to solution for both approaches is negligible. For 
np = 30 the naive non-optimal coarse solver takes less than 10% of the total run- 
time and the communication load is less than 5% compared to the FLOP count. 
Moreover the savings in run-time in case of a PETSc coarse solver are limited due 
to the set-up phase. In conclusion, and somewhat contrary to conventional wisdom, 
we find that for up to ten million unknowns, a more sophisticated coarse solver does 
not pay off. 

To show its generality, we also test the strategy on a different peta-scale system 
(Hazel Hen) and for a problem with a strongly varying discontinuous viscosity 
profile as it is relevant for geodynamical simulations, see Table 6. Here, we 
additionally exploit a block low-rank coarse grid solver (BLR) which uses low- 
rank compressions in an elimination method to devise an approximate solver [1]. 
The reduced parallel efficiency, compared to the setting before, does not result from 
the different architecture, nor the use of BLR, but mainly from the fact that one 
additional multigrid iteration is required. 


Table 6 Parallel efficiency in a weak scaling experiment for a geodynamically relevant setting 


on Hazel Hen 


Time [s] 

E BLRe_ | Coarse Ana. & fac. 
2.7 
25.0 
i116 
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3.4 Multi-Level Monte Carlo 


Due to errors in measurements, geophysical data are inevitably stochastic. There- 
fore, it is desirable to quantify these uncertainties when computing geophysical 
applications. A typical approach is to use ensemble simulations, i.e. running 
multiple computations with slight variations of perturbed data, and evaluating 
computational quantities of interest via Monte Carlo sampling. A naive sampling- 
based uncertainty quantification for 3D partial differential equations results in an 
extremely large computational complexity. More sophisticated approaches, such 
as multilevel Monte Carlo (MLMC), can reduce this complexity significantly. The 
performance can be further enhanced when the Monte Carlo sampling over several 
levels of mesh refinement is combined with a fast multigrid solver. In a parallel 
environment, however, sophisticated scheduling strategies are needed to exploit 
MLMC based on multigrid solvers. In [28], we explored the concurrent execution 
across the three layers of the MLMC method. Namely, parallelization across levels, 
across samples, and across the spatial grid. 

An alternative way to classify these different parallel execution models is to 
consider the time-processor diagram, as illustrated in Fig.4, where the scheduling 
of each sample Yİ, with index 1 < i < Ne on level 0 < £ < L, is represented 
by a rectangular box with the height expressing the number of processors used. We 
call a parallel execution model homogeneous bulk synchronous if at any time in the 
processor diagram all tasks execute on the same level £ with the same number of 
processors. Otherwise we call it heterogeneous bulk synchronous. 

The one-layer homogeneous strategy, as shown in Fig. 4 (left), offers no flex- 
ibility. The theoretical run-time is simply given by the sum of the time of all 
samples. Even if this method guarantees perfect load balancing, it will not lead 
to an optimal efficiency since on the coarser levels the scalability of the solver is 
typically significantly worse than on the finer levels. On the coarsest level we may 
even have less grid points than processors. Thus, we discard this approach from our 
list of possible options. 

We, instead, consider the following two variants: Firstly, sample synchronous 
homogeneous (SaSyHom) where a synchronization step is imposed after each 
sequential step (see Fig.5, left). Here statistical quantities can be updated after 


synchronization point synchronization point synchronization point 
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Fig. 4 From left to right: illustration of homogeneous (one-layer and two-layer) and of heteroge- 
neous bulk synchronous strategies (two-layer and three-layer) 
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Fig. 5 Illustration of different homogeneous scheduling strategies. Left: sample synchronous 
homogeneous (SaSyHom); centre: level synchronous homogeneous (LeSyHom); right: dynamic 
level synchronous homogeneous (DyLeSyHom) 


each synchronization point. Secondly, level synchronous homogeneous (LeSyHom), 
where each block of processors executes all samples without any synchronization 
(see Fig. 5, center). 

In case of large run-time variations of the samples, we additionally consider 
dynamic variants where samples can be assigned to processors dynamically. 
Figure 5 (right) illustrates the DyLeSyHom strategy. Here it is essential that not 
all processor blocks execute the same number of sequential steps. 

We compare static strategies with their dynamic variants in the case of large 
run-time variations on three MLMC levels, i.e. L = 2, with a fine grid resolution 
of about 1.6 - 10’ mesh nodes. We assume a lognormal random coefficient, but in 
order to increase the run-time variation, we choose a greater variance o? = 40 
and à = 0.5. As quantity of interest we employ the PDE solution u evaluated at 
the point x = (0.5, 0.5, 0.5). All samples are computed with a FMG-2V(4,4) cycle 
with up to a hundred additional V(4,4) cycles until a relative residuum of 10~!9, 
Pre-computed variance estimates lead to an a priori strategy with (N¢)e=01,2 = 
(27,151, 10,765, 3792) samples per level. In this scenario, 8192 processors were 
available for the scheduler. In Table 7 we compare the required time of the LeSyHom 
strategy in the static and dynamic variant and see that the dynamic strategy is 6% 
faster than its static counterpart. 

The largest uncertainty quantification scenario we considered, involved finest 
grids with almost 70 billion degrees of freedom, and a total number of samples 
beyond 10,000, most of which are computed on coarser levels. This computation 
was executed on the JUQUEEN supercomputer using more than 132,000 processor 
cores in an excellent overall compute time of about 1.5h. 


Table 7 Comp Arison òt LeSyHom | DyLeSyHom 
static and dynamic ti ti Rati 
scheduling strategies ime [s] ime [s] atio 
500 | 460 0.92 
1512 1347 0.89 


5885 5596 0.95 
7897 7403 0.94 
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3.5 Inverse Problem and Adjoint Computations 


The adjoint method is a powerful technique that allows the computation of 
sensitivities (Fréchet derivatives) with respect to model parameters. It solves inverse 
problems where analytical solutions are not available or the cost to solve the 
associated forward problem many times is prohibitively high. In geodynamics it 
has been applied to the inverse problem of mantle convection—i.e., to restore past 
mantle flow e.g. [22], where it finds an optimal initial flow state that evolves into the 
present-day state of Earth’s mantle. In doing so, the adjoint method has the potential 
to link together diverse observations and theoretical expectations from seismology, 
geology, mineral physics, paleomagnetism and fluid dynamics, greatly enhancing 
our understanding of the solid Earth system. 

Adjoint equations for mantle flow restoration have been derived for incom- 
pressible [22, 52, 59], compressible [35] and thermo-chemical mantle flow [36], 
and the uniqueness properties of the inverse problem have been related explicitly 
to the tangential component of the surface velocity field of a mantle convection 
model [25]. So knowledge of the latter is essential to assure convergence [100] 
and to obtain a small null space for the restored flow history [52]. To reduce the 
computational cost, we use a two-scale time step size predictor-corrector strategy to 
couple between Stokes and temperature similar to the one discussed in [46]. 

The framework was used to implement the adjoint technique. Specifically, we 
minimized the misfit functional 


1 2 
x(T) = 5J I (TT.x, j= T(x) 3(t — ty) dt dx (5) 


Q [tot] 


that relates to the squared difference of the geodynamic model temperature and the 
thermal heterogeneity distribution Tg for the final state at t = t1. The latter can be 
inferred e.g., from seismology and considerations of mantle mineralogy. We used a 
first order optimization approach to update the initial temperature 7p in the descent 
direction ¢ in a process described by 


Tit1(x) = Ti (x) — fox, to) fori=O0,1,..., 


where £ controls the correction in the descent direction and has to be sufficiently 
small [84]. In order to reduce the computational cost of deriving the step length, 
we found that fixing 6 = 0.4 yields sufficient results in our experiments, in 
agreement with [22]. The descent direction $(x, to) is obtained by solving the 
adjoint equations, which are given by Bunge et al. [22]. Solving the adjoint 
equations requires velocity and temperature states from the forward problem. We 
checkpointed these states in the forward problem such that they could be read for 
the adjoint iteration. In our exploration of the inverse problem, we used a similar 
setup as presented in [23]. We considered a 45° x 45° part of a spherical shell 
domain and discretized it by a tetrahedral mesh. Applying four uniform refinement 
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steps, the mesh has a resolution of 2.7 - 10° grid points. The initial temperature field 
given by 


1 
T(x) = To + exp (-5ix — xol?) l 


where o = 1/20 determines the extent of the anomaly and xo denotes the center. 
Note that in our setup the latter is situated 2/9 D below the core-mantle boundary, 
where D denotes relative mantle thickness, and thus lower than in [23]. 


3.5.1 Twin Experiment 


To verify our implementation, we considered a so called twin experiment. In this 
setup a reference flow history is generated from the forward model. The final state 
of this reference flow is used as the terminal state that drives the geodynamic 
inverse problem. The reconstructed flow history is then compared against the known 
reference history to assess the quality of the inversion. Thus, while in a realistic 
geophysical setting only the final state Tg is known, in the twin experiments, the 
true (unreconstructed) initial Ty and final state Tg are given from the reference twin. 
This allows us to study the convergence of initial and final states. In addition to the 
misfit error (5) at the final time, we define the misfit error at the initial time by 


1 2 
mı(T) := >| | (Ta, t)— T;(x)) 5(t — to) dt dx 


Q [to,t1] 


In Fig. 6, we present the initial/final temperature (left/right) from the reference 
twin. Figure 7 shows isosurfaces of the reconstructed initial (top row) and final 
(bottom row) temperature after adjoint iterations i = 0,5, 10. For the sake of 
simplicity, the first guess for the unknown initial temperature is taken from the final 
state of the reference twin. Therefore, the reconstructed final temperature for i = 0 
is a plume that has evolved much farther than the final reference state. After solving 
the inverse problem with 10 iterations, the error at the final state is reduced by more 
than one order of magnitude, while the error at the initial state is reduced by a 


aly atl 


Fig. 6 Left: initial temperature state, right: final temperature state of the reference twin 
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Fig. 7 From left to right: reconstruction of initial (top row) and final (bottom row) state: i = 
0, 5, 10 iterations, very right: reference states 


factor of 3, and we observe a good agreement between reconstructed and reference 
temperatures. 


3.6 Matrix-Free Algorithms 


Matrix-free approaches are an essential component for ultra-high resolution finite- 
element simulations. The reason for this is twofold. The classical finite-element 
workflow of assembling the global system matrix from the local element contri- 
butions and then solving the resulting linear system of equations with algorithms 
implemented in matrix-vector fashion leads to enormous data traffic between 
main memory and computational units (CPU cores). This constitutes a severe 
performance bottleneck. Another aspect is that the cost for holding the matrix 
in main memory becomes prohibitive, too. Note that for our simulations with 
10! DoFs, see [43] and Sect.3.3 the solution vector alone requires 80 TByte of 
memory. Thus, matrix-free methods, which do not assemble the global matrix, but 
only provide the means to evaluate an associated matrix—vector product receive 
increasing attention, see e.g. [5, 20, 66-68, 78, 92]. An overview on the history 
and different approaches can e.g. be found in [9]. 

We note that the concept of hierarchical hybrid grids, from its inception on, 
was designed to be matrix-free. Based on its original premise that the macro mesh 
resolves both, geometry and material parameters, only a single discretization stencil 
had to be computed and stored for each macro primitive, as opposed to one for 
every fine mesh node, which corresponds to one row of the global matrix, [13]. 
For the case of locally varying material parameters, such as the viscosity in mantle 
convection, this could be extended, using classical local element matrix based 
approaches, see [9, 14, 37]. However, this approach does not carry over to the 
case of non-polyhedral domains and/or higher order elements. Thus, we developed 
alternative and also more efficient approaches. 
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3.6.1 Matrix-Free Approaches Based on Surrogate Polynomials 


In our HHG and HYTEG frameworks the coupling between DoFs is considered in 
the form of a stencil like in classical finite differences. Due to the local regularity of 
the refined block-structured mesh the stencil pattern is invariant across an individual 
macro primitive. For example in the case of P; elements and a scalar equation in 
3D we obtain a 15-point stencil for all nodes within a volume primitive or on a 
face primitive. However, in the case of a curved domain and/or locally variable 
material parameters the stencil weights are non-constant over a macro primitive. 
These weights can be computed on-the-fly when they are needed to apply the stencil 
locally, e.g. during a smoothing step, by standard quadrature. Note that at least for 
Pı elements this is theoretically still faster than the more sophisticated approach of 
fusing quadrature and application of the local element-matrix, see [66]. However, 
it is significantly slower than the original one-stencil-per-primitive scenario. Thus, 
in [8] we introduce a new two-scale approach that employs surrogate polynomials 
to approximate the stencil weights. The idea, in a nutshell, is to consider the 
individual stencil weights as functions on each primitive. These functions can 
be sampled, in a setup phase, on a certain number of sampling points, typically 
for all nodes of the primitive on a coarser level of the mesh hierarchy. Then 
an approximating polynomial is constructed for the weight by determining the 
polynomial coefficients through a standard least-squares fit. Whenever the stencil 
weight is needed the associated polynomial is then evaluated. Performance-wise 
this raises two questions. How much memory is required for storing the polynomial 
coefficients and what is the cost for evaluating the polynomial, as opposed to that of 
on-the-fly quadrature. The answer to the first question is that the memory footprint 
even in 3D is not too large for polynomials of moderate degree. Consider as an 
example a trivariate polynomial of degree five. It has 56 coefficients. Representing 
a 15-point stencil by such polynomials for all nodes of a volume primitive, thus, 
requires only 6720 bytes. Note also that the number of polynomials needed can be 
reduced by symmetry arguments and the zero row-sum property of consistent weak 
differential operators, see [8, 30]. Thanks to the logical structuredness of the mesh 
inside our volume and face primitives evaluation of a polynomial can be performed 
lexicographically along lines. It, thus, reduces to a 1D problem which can efficiently 
be executed using incremental updates based on the concept of divided differences, 
see [8, 30]. 

In [8] we conducted an extensive performance study of the new approach for 
the 3D Poisson problem on different curved domains which we connected to a 
polygonal domain by means of a mapping function. Thus, we are in fact solving 
a diffusion equation with a non-constant symmetric definite material tensor. As an 
example Fig. 8 shows results for a weak scaling experiment on SuperMUC. Here 
we consider a V(3,3) cycle and assign 16 macro elements to one core. On the 
finest grid there are 3.3 x 10° DoFs per volume primitive. We see that the surrogate 
operator approach (using quadratic polynomials) clearly outperforms the on-the-fly 
quadrature and is only about a factor two more expensive than the original one- 
stencil-per-primitive approach in HHG which cannot capture the domain curvature. 
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Fig. 8 Weak scaling behavior of a V(3, 3) cycle for different approaches: on-the-fly quadrature 
(IFEM), surrogate polynomials (LSQP), surrogate polynomials combined with double discret- 
ization (LSQP DD) and surrogate polynomials after node-level performance optimization (LSQP 
OPT); for comparison (CONS) gives times for the one-stencil-per-primitive approach that neglects 
curvature 


Additionally, we study the influence of the approximation on the discretization 
error and the multigrid convergence rate. The former is influenced by the ratio 
between the mesh size of the macro mesh and the solution level, the sampling 
level and also the polynomial degree, while the latter is basically non-existent, 
see Table 8. In [30] we present a theoretical analysis of the consistency error of 
the surrogate operator approach and extend our numerical experiments to more 
challenging differential operators using problems from linear elasticity and non- 
linear p-Laplacian diffusion. 

In mantle convection models viscosity may vary significantly on a local scale, 
i.e. within a volume primitive. In this case, an approximation of stencil weights by 
low-order polynomials as in [8] will no longer be sufficient. One might try higher 
order polynomials as was implemented in HYTEG for the tests presented in [30]. 
Albeit this is currently only available in 2D. Alternatively, we devised a variant 
of our approach, see [10, 11], denoted in the following as LSQP,. Here, instead 
of directly approximating the stencil weights, we replace the entries of the local 
element matrices by surrogate polynomials. The stencil weights are then computed 


Table 8 Discretization error in the L?-norm for IFEM and LSQP (with different polynomial 
degree q) and ratio of the asymptotic multigrid convergence rates for curved pipe domain (level 
£ = Orefers to an already twice refined mesh) 


tal sees 
t=? 1.86 x 10- [1.00 
eza [asais [467x 10 fio [4.64x 10-° | 1.00 
eza [ii6x 10 [raxo fio ‘[1.16x10- [1o 
e=s [289x107 [987x107 [10 [310x107 |100 
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Table 9 Weak scaling of the LSQP. approach using ca. 2.3 x 10’ DoFs per core; shown 
are: average run-time w/ and w/o coarse grid solver (c.g.) for one UMG cycle and no. of 
UMG iterations; values in brackets give no. of c.g. iterations (preconditioner/MINRES); parallel 
efficiency w.r.t. one UMG cycle is given for timings w/ and w/o c.g.; additionally the average time 
for a single residual application on the finest level is given 


Global # UMG Time UMG cycle | Parallel Time 
Islands | Cores | DoFs resolution | V-cycles w/ c.g | w/oc.g. | efficiency | residual 
1 5580 | 1.3 x 10!! | 3.4km 7 (50/150) | 192s |164s 1.00 / 1.00 | 11.9s 
2 12,000 | 2.7 x 10!! [2.8km |10 (100/150) | 213s | 169s 0.90 / 0.97 | 12.1 s 
4 21,600 | 4.8 x 10!! | 2.3 km 7 (50/250) |210s |172s 0.92 / 0.96 | 12.7 s 
8 47,250 | 1.1 x 10! | 1.7km 8 (50/350) | 230s | 173s 0.83 / 0.95 | 12.8 s 


from these surrogate element matrices in the standard fashion for the P; case. This 
variant was used to perform the dynamic topography simulations of Sect. 3.10. As 
an example Table 9 shows weak scaling results on SuperMUC Phase 1 for a mantle 
convection model with a temperature-dependent viscosity and a global resolution of 
1.7km for the largest scenario. Run-times are for a V-cycle with an Uzawa smoother, 
stopping criterion is a residual reduction of five orders of magnitude. For complete 
details we refer to [11]. Note the parallel efficiency of 83% for the largest run, 
which could be further improved by using a more sophisticated coarse grid solver, 
see Sect. 3.3, than the one available at the time of the experiment. 


3.6.2 A Stencil Scaling Approach for Accelerating Matrix-Free Finite 
Element Implementations 


In [9] we present a novel approach to fast on-the-fly low order finite element 
assembly for scalar elliptic partial differential equations of Darcy type with vari- 
able coefficients optimized for matrix-free implementations. In this approach, we 
introduce a new operator that is obtained by scaling the reference operator, i.e. the 
stencil obtained from the constant coefficient case. Assuming sufficient regularity, 
an a priori analysis showed that solutions obtained by this approach are unique and 
have asymptotically optimal order convergence in the H!-norm and the L?-norm 
on hierarchical hybrid grids. These preliminary considerations motivate our novel 
approach to reduce the cost by recomputing the surrogate stencil entries for a matrix- 
free solver in a more efficient way. 

To demonstrate the advantages of our novel scaling approach we consider, among 
other examples, a Poisson problem on a domain defined through a blending function. 
Particularly, we consider a half cylinder mantle with inner radius rı = 0.8 and outer 
radius r2 = 1.0, height zı = 4.0 and with an angular coordinate between 0 and z as 
our physical domain Qppy. The cylinder mantle is additionally warped inwards by 
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w(z) = 0.2 sin (z7/z1) in axial direction. The mapping ® : Qppy > is given by 


Vx? + y? + w(z) 
P(x, y, z) = | arccos (*//x2+y7) 
z 


with the reference domain Q = (r1, r2) x (0, 7x) x (0, z1). It follows for the mapping 
tensor K of the Poisson problem that 


1 2 / 

(D®)(D®)™ - f w'(z) +1 0 w' (z) 
= idet Do] = xe + y (0) 1/œ?+y>) 0 
w’ (z) 0 1 


Obviously, this tensor is symmetric and positive definite. In addition to the geometry 
blending, we use a variable material parameter a(x, y, z) = 1 + z. This yields the 
following PDE on the reference domain Q: —div(aK Vu) = f. Additionally, the 
manufactured solution is chosen as 


u(x, y, Zz) = sin ( ai n) cos (4y) exp (¢/2) . 
2-1) 

For our numerical experiments, we employ a macro mesh composed of 9540 
hexahedral blocks, where each block is further split into six tetrahedral elements. 
The resulting system is solved on SuperMUC Phase 2 using 14,310 compute cores, 
i.e., four macro elements are assigned per core. The largest system involves solving a 
linear system with o(10! 1) DoFs. We employ a multigrid solver with a V(3,3) cycle 
and the iterations are stopped when the residual has been reduced by a factor of 
1078. In Table 10, we report on the resulting discretization error and its estimated 
order of convergence (eoc), the asymptotic multigrid convergence rate p, and the 
time-to-solution for different refinement levels £. Note that we restricted the scaling 


approach in this test to volume and face primitives, which contain the vast majority 
of DoFs. 


Table 10 Results for large scale 3D application with errors measured in the discrete L?-norm 


Nodal integration _ Scale Vol+Face i Rel. 
DoFs Error eoc |p tts [s] | Error eoc |p tts [s] | tts 
4.7 x 10° |2.43 x 1074 |- 0.522 2.5 |2.38 x 1074 |- 0.522 | 2.0 | 0.80 


3.8 x 107 [6.00 x 10-5 |2.02 |0.536 | 4.2 |5.86 x 1075 | 2.02 [0.536 | 2.6 |0.61 
3.1 x 108 [1.49 x 10-5 |2.01 |0.539 | 12.0 [1.46 x 1075 |2.01 |0.539 | 4.5 [0.37 
2.5 x 10? |3.72 x 1076 |2.00 |0.538 | 53.9 |3.63 x 1076 | 2.00 | 0.538 | 15.3 | 0.28 
2.0 x 10!° |9.28 x 1077 |2.00 | 0.536 | 307.2 |9.06 x 10-7 |2.00 |0.536 | 88.9 |0.29 
1.6 x 10!! |2.32 x 1077 | 2.00 | 0.534 | 1822.2 | 2.26 x 1077 |2.00 | 0.534 | 589.6 | 0.32 
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These results demonstrate that the new scaling approach reproduces the dis- 
cretization error, as is expected from our variational crime analysis [8, 9, 30]. 
Additionally, the multigrid convergence rate is not affected. For larger £ the run- 
time as compared to the nodal integration approach requires only about 30% of the 
time. 


3.6.3 Stencil Scaling for Vector- Valued PDEs with Applications to 
Generalized Newtonian Fluids 


Our target problem is vector-valued, thus, we expanded the scalar stencil scaling 
idea from Sect. 3.6.2 and developed a similar matrix-free approach for vector- 
valued PDEs [31]. The construction is again based on the use of hierarchical hybrid 
grids, the conceptual basis in the HHG and HYTEG [63] frameworks. Vector-valued 
second-order elliptic PDEs play an important role in mathematical modeling and 
arise e.g. in problems from elastostatics and fluid dynamics. Numerical experiments 
indicated that the idea of the scalar stencil scaling (denoted below as unphysical 
scaling) cannot directly be applied to these equations, because the standard finite- 
element solution cannot be reproduced, even in the case of linear coefficients. Thus, 
there is need of a modified stencil scaling method (denoted below as physical scal- 
ing) that is also suited for matrix-free finite element implementations on hierarchical 
hybrid grids. It turns out this vector-valued scaling requires computation of an 
additional correction term. While this makes it more complicated and expensive, 
compared to the scalar stencil scaling, it is able to reproduce the standard finite- 
element solutions, while requiring only a fraction of the time to obtain them. In the 
best scenario, we could observe a speedup of about 122% compared to standard 
on-the-fly integration. Our largest example involved solving a Stokes problem with 
12,288 compute cores. 

One of the examples studied is the scenario of a non-linear incompressible Stokes 
problem where the fluid is assumed to be of generalized Newtonian type, modeled 
by a shear-thinning Carreau model 

z 
1) = Noo + (10 = Noo) (1 + elewl?) 
The parameters employed are given in Table 11 in dimensionless form. The values 
stem from experimental results, cf. [34, Chapter IT]. 
The computational domain Q is depicted in Fig.9. The channel has a length 


of 5 and a depth and height of 1. The kink starts at position x = 1.5. The 
domain is discretized by 14,208 tetrahedra on the coarsest level. The boundary 


Table 11 Dimensionless parameters for the Carreau viscosity model 


Parameter no Noo K r 
Value 140.764 1.0 212.2 —0.325 
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Fig. 9 Velocity streamlines for the non-linear generalized Newtonian Stokes problem 


dQ is composed into Dirichlet and Neumann parts, i.e., 92 = Ip U Ty with 
Tp = {(x, y,z) E€ 0Q | x < 5} and Ty = 0Q\[p. The volume force term f, 
the external forces t, and the Dirichlet boundary term g are set to 


A 


f = 0,0, 100)", ê= (0,0,0)", and g = 16 y(1— y)z(1—z)- (1,0, 1)". 


We solve this non-linear system by applying an inexact fixed-point iteration where 
the underlying linear systems are only solved approximately to prevent over-solving. 

In order to solve the systems, we employ the inexact Uzawa solver presented 
from [29] with variable V (3, 3) cycles where 2 smoothing steps are added to each 
coarser refinement level which enforces convergence of the method. 

In Table 12, we present the relative errors of the physical and unphysical scaling 
approaches along the line 6 = [0,5] x {0.5} x {0.42} in the supremum norm, 
computed on the 4 times refined grid. It can clearly be seen that the physical scaling 
yields significantly better results with errors smaller by three orders of magnitude. 
Further experiments indicate that the unphysical scaling does not converge to the 
standard finite element solution even after additional mesh refinements. Therefore, 
we conclude that the unphysical scaling is an inconsistent discretization for vector- 
valued problems. 

Table 13 presents the relative time-to-solutions for the nodal integration and 
physical scaling measured on SuperMUC Phase 2. On the finest level, after 6 
refinements, the relative time-to-solution of the physical scaling is at about 81%. 
Figure 9 shows the streamlines of the velocity within the computational domain 
computed with the physical scaling. 


Table 12 Relative errors 
along the line 6 in the 
supremum norm for different 
scaling approaches 


Viscosity Velocity 


Unphysical | Physical 
1.07 x 107! | 5.65 x 1075 


Physical Unphysical 


3.19 x 107? 
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Table 13 Relative 


: : , Nodal integration | Physical scaling | Relative 
time-to-solution comparison 


of the nodal integration and Dots tts [s] uss us 
physical scaling approach for 4.69 x 10° 309.10 364.97 1.18 
the non-linear generalized 3.82 x 107 | 361.90 412.10 1.14 
Stokes problem 3.08 x 108 | 895.18 719.55 0.80 
2.47 x 10° | 3227.45 2626.13 0.81 


3.7 Resilience 


Extremely concurrent simulations may in the future require a software design that is 
resilient to faults of individual software and hardware components. The probability 
of faults in the underlying systems increases with the number of parallel processes. 
Especially long-running simulations may suffer from lower mean time between 
failures and restarting applications with run-times of several hours or days consumes 
vast amounts of resources. 

Therefore, fault tolerance techniques have become a research topic as preparation 
for possibly unreliable future exascale systems. In the TERRANEO project we 
have so far mainly focused on hard faults. To cope with the failure of a core 
or node, we mainly distinguish between two categories of methods: one class 
relies on checkpointing techniques, where snapshots of the simulation are stored 
in regular intervals so that the state can be loaded upon failure. A second class 
are algorithm-based fault tolerance (ABFT) techniques, where (partially) lost data 
can be re-computed on-the-fly. Regarding large-scale simulations, checkpointing 
techniques often suffer from bad serial and parallel performance if data is written 
to disk. However, there are approaches that solely rely on distributed, in-memory 
checkpointing, which could be combined with compression techniques [71] that 
were also explored in a student project [74]. This can lead to a flexible, fast, and 
scalable resilience method [64]. 

In [53, 54] an alternative ABFT technique to provide resilience specifically for 
multigrid methods is developed. Given a faulty subdomain QF, created when a 
processor (core or node) crashes, the global solution can be recovered by solving 
a recovery problem on QF via a local multigrid iteration. After the recovery 
subproblem is solved, the global iteration continues. 

A priori, it is not clear, how many iterations will be required in the faulty 
subdomain. In [54] a fixed number of V-cycle iterations is employed. This can lead 
to under- or over-solving in Qf. While under-solving leads to a poor recovery, 
over-solving inhibits the global convergence due to wrong values at the interface. 
The convergence behavior of both scenarios is illustrated in Fig. 10. 

To address this issue, in [56] the recovery algorithm is enhanced by an adaptive 
control mechanism. Instead of solving the subproblem using a fixed number 
of iterations, a hierarchically weighted error estimator is introduced to define 
a stopping criterion for the faulty subdomain. The estimator is designed to be 
well-suited to extreme-scale parallel multigrid settings as they are employed in 
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Fig. 10 Convergence FAULT 
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applications. It guards the algorithm against over- or under-solving the subproblem 
and therefore wasting computational resources. 

As already suggested in [54], a so-called superman strategy is employed, which 
increases the computing power locally in the faulty subdomain. The recovery 
computation is accelerated e.g. through a further subdivision of Qy by means of 
additional shared memory parallelism. In Fig. 11 the development of the estimated 
error of a recovered solution is compared to the error of a fault-free scenario and a 
no-recovery scenario. Locally, the number of processes is increased by a factor of 4. 

We have also conducted studies with asynchronous recovery strategies. Here, the 
local problem on Qp and the problem on the healthy domain Q \ Qp are solved 
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concurrently. This means that during the recovery process, the multigrid iteration 
is also continued in parallel in the healthy domain. Different strategies have been 
explored in [54] which boundary conditions are suitable for the healthy domain on 
the interface. After the stopping criterion in the faulty domain is fulfilled, the local 
recovery iteration is terminated and a recoupling procedure is initiated. 

To emphasize the applicability to large-scale scenarios, the adaptive approach is 
scaled in [56] to a simulation with about 6.9 x 10!! DoFs and 245, 766 processes 
run on the JUQUEEN supercomputer. Additionally, the generality of the approach 
was demonstrated for a scenario where multiple failures in different regions of the 
domain have been triggered. 


3.8 General Performance Issues 


Simulations in geophysics require a high spatial resolution which leads to problems 
with many degrees of freedom. The success of a geophysics simulation framework 
will thus depend on its scalability and that it has only minimal memory overhead. In 
the previous sections we already presented scalability results to highlight the results 
obtained in the TERRANEO project. 

However, message passing scalability is not the only metric that is important 
in high-performance computing. Even more important is the absolute performance 
which depends critically also on the performance on a single core or a single node. 
In the extreme scale computing community the relevance of a systematic node-level 
performance engineering is increasingly being realized, see e.g. [4, 45, 48, 60, 67]. 
Note that traditional scalability is easier to achieve when the node-level performance 
is lower. Publishing only scalability results without absolute performance measures 
is a frequently observed parallel computing cheat, as exposed in [2]. We emphasize 
here that the TERRANEO project has in this sense profited from long term research 
efforts in HPC performance engineering. These stem originally from the Dime 
project* [65, 97] and have led to the excellent performance features of the HHG 
library [12, 13]. Thus, node level performance in TERRANEO is not coming as an 
afterthought imposed on existing codes, but has been an a priori design goal with 
high priority. 

Within TERRANEO, these techniques were continuously employed to analyze 
the performance and were used to guide the development of new matrix-free 
methods. This includes rather simple analysis like calculating the update cost for 
a single stencil update in terms of floating point operations per second which allows 
comparing the actual performance to the maximally achievable performance on a 
given hardware. This type of analysis has been performed in [10] and [11]. 

In [9] this metric was also extended with an analysis of the memory traffic that 
is required to apply the stencil to a function which, in terms of linear algebra, 


“http://dime.fau.de. 
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Fig. 12 Roofline model for different approaches presented in [9] 


corresponds to matrix-vector multiplication. The analysis provided a lower and 
upper bound for the performance with respect to memory. One for the optimistic 
assumption that all data required resides in the last level cache, and one for the 
pessimistic assumption that everything needs to be loaded from main memory. To 
evaluate this analysis the Intel Advisor? was used as an automatic tool to perform 
the well established Roofline analysis [58, 103]. The results are given in Fig. 12. 

In cooperation with the HPC group of the RRZE in Erlangen, a more sophis- 
ticated analysis was performed in [8, 40, 41] where the Execution-Cache-Memory 
(ECM) model [48] was used to examine the performance. 

The Roofline model assumes that only one data location is the bottleneck. This 
can be the main memory or any of the cache levels represented by the four different 
limits in Fig. 12. The ECM model however, takes into account memory, all levels 
of cache, and finally the registers within the CPU. The model helped to design 
the new computational kernels by identifying performance bottlenecks and guiding 
subsequent performance optimization steps. The end result of this development 
procedure are kernels where the analytically predicted optimal duration of a stencil- 
based update is in good agreement with the measurements in the real program code. 
Here an error of only ~15% is considered acceptable. In Fig. 13 the occupancy of 
the execution units within the CPU is shown, as it was derived from the performance 
models. 


Shttps://software.intel.com/advisor. 
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Fig. 13 Duration of the different phases involved during eight stencil-based updates as obtained 
by the ECM model for the newly developed kernel in [8] 


3.9 HyTeG 


We successfully demonstrated that the combination of a fully unstructured grid 
with regular refinement and matrix-free algorithms can be used to solve large 
geodynamical problems. A substantial amount of work in the TERRANEO project 
was performed using the HHG prototype. Even though the latter shows excellent 
performance and scalability, certain issues need to be faced. Due to the fact that 
the codebase was created over a decade ago, the coding standard cannot live up 
to current requirements. This makes it hard to maintain the framework and to 
familiarise new users with it, which is essential for a community code. Furthermore, 
the prototype was never meant to be used for higher-order discretizations or other 
than nodal elements. 

Therefore a redesign is initiated under a new name: Hybrid Tetrahedral Grids 
(HYTEG) [63]. In addition to building on the knowledge gathered over the years 
from developing the HHG framework we could also utilize ideas and infrastructure 
from the WALBERLA framework [44]. One of the fundamental changes in HYTEG 
is the strict separation of the data structures that define the macro mesh geometry 
and the actual simulation data. As in HHG, the tetrahedra of the unstructured macro 
mesh are split into their geometric primitives, namely volumes, faces, edges and 
vertices. The lower dimensional primitives (faces, edges and vertices) are used to 
decouple the volume primitives in terms of communication. Contrary to HHG the 
parallel partitioning is not solely based on volume primitives, instead all primitives 
get partitioned between processes. This e.g. allows to take also the computational 
and memory footprint for face primitives into account for load balancing, for which 
we can employ sophisticated tools such as ParMetis [61]. Note that in HyTEG 
the partitioning does not involve global data structures, an essential aspect when 
entering the exascale era. Like with HHG, it was decided to use C++ as the 
primary programming language due to its high performance and spread in the HPC 
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Fig. 14 Basic types can be combined to create various finite element discretizations 


community. Up to this point, no compute data are associated with the primitives. 
In a second step, these can be attached to the primitives. This might be a single 
value for only some of the primitives or a full hierarchy of grids for every single 
primitive. This separation also allows for efficient light-weight static and dynamic 
load balancing techniques similar to those described in [32, 96]. 

As mentioned above, one major goal when moving from HHG to HYTEG is 
the realization of different discretizations on hierarchical hybrid grids. In addition 
to unknowns located at the vertices of each micro-element of the mesh, unknowns 
on the edges, faces and inside the elements are also supported. These basic types 
of unknowns are implemented in such a way that different discretizations can be 
realized by combination which is illustrated in Fig. 14. Using this technique, a 
Taylor-Hood discretization for a Stokes flow simulation can be realized as well as 
finite volume discretizations to simulate energy transport [63]. 

By introducing different types of discretizations, there are also many more 
compute-intensive kernels that need to be taken care of. The solution chosen 
in HYTEG to solve this problem is code generation. Inspired by the work in 
ExaStencils [76, 94], our joint collaboration in this direction [15], and also using 
experience from pystencils® this can be efficiently realized. One difference to 
other projects that use whole program generation, however, is that only compute 
intensive kernels are generated, but not the surrounding data structures. In contrast 
to HYTEG itself, pystencils is using Python, which allows for much more flexibility 
and metaprogramming capabilities. The generated kernels, however are in C++, 
which eliminates the need for a Python environment when running HYTEG on 
supercomputers. 


Shttps://il Ogit.cs.fau.de/pycodegen/pystencils. 
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Fig. 15 Each combination of one leaf of every color represents one possible configuration 


Another important feature that was introduced with the community code in 
mind is a state-of-the-art continuous integration process. With every change in the 
codebase, an automated process is started, which ensures the compatibility with 
different hardware and software configurations. Additionally, extensive testing is 
also part of the pipeline to guarantee the correctness of all features inside HYTEG. 
To illustrate the complexity of this process Fig. 15 shows all possible combinations 
of configurations. 


3.10 Asthenosphere Velocities and Dynamic Topography 


In order to demonstrate the flexibility of the different algorithmic components, we 
have studied selected application questions. In [102] we investigated the question 
of flow speeds in the asthenosphere. The latter is a mechanically weak layer in the 
upper mantle right below the lithosphere, the rigid outermost shell of our planet. It 
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plays an important role in convection modelling as it distinguishes itself from the 
lower parts of the mantle by a significantly lower viscosity. The precise details of the 
contrast are, however, unknown. A variety of geologic observations and geodynamic 
arguments indicates that velocities in the upper mantle may exceed those of tectonic 
plates by an order of magnitude [49, 51]. The framework was used to simulate high- 
resolution whole earth convection models with asthenospheric channels of varying 
thickness. Reduction of the asthenospheric thickness is balanced by an associated 
reduction in its viscosity, following the Cathles parameter [88]. This resulted for the 
tested end-member case in a set-up with an asthenosphere channel of only 100 km 
depth and a significant viscosity contrast of up to 4 orders of magnitude relative 
to the deeper mantle. We found a velocity increase by a factor of 10 between a 
reference case with an upper mantle of 1000 km depth and the very thin channel end- 
member case, translating into speeds of ~ 20 cm/a within the narrow asthenosphere. 
Our suggested and numerically verified Poiseuille flow model predicts that the upper 
mantle velocity scales with the inverse of the asthenospheric thickness. 

Note that the prototype implemented in HHG already allows to include real- 
world geophysical data. The model presented in Fig. 16, e.g., uses present-day plate 
velocities, see [83], as surface boundary conditions and a buoyancy term based on 
present day temperature and density fields, converted from a seismic tomography 
model via techniques described in [27]. Viscosity depends on temperature and 
includes a jump at the bottom of the asthenosphere, chosen at 660 km. 

Another geodynamic quantity we studied is dynamic topography. This term, 
which plays a crucial role in our understanding of long term sea-level variations, 
refers to deflections of the Earth’s surface as a response to viscous stresses in the 
mantle, and has been known from geodynamic arguments for a long time [47, 87]. 
The precise magnitude of dynamic topography is still debated, however, due to 
uncertainties in measuring it [18]. Dynamic topography occurs both at the surface 


Fig. 16 Mantle convection model computed in HHG with real-world geophysical data: tempera- 
ture field from seismic tomography for viscosity (left) and resulting flow speeds (right); quantities 
are non-dimensional 
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Fig. 17 Surface dynamic topography for cases: A—viscosity with only radial variations, B— 
with the addition of viscosity variations due to varying thickness of the lithosphere, C—with 
additional temperature-dependent viscosity in the lower mantle (left) and pairwise difference 
between scenarios (right) 


and the core-mantle boundary (CMB), see [26] for a review. In [10] we studied 
the influence of lateral viscosity variations on the amplitude and pattern of dynamic 
topography. We first considered a standard benchmark with a purely radially varying 
viscosity profile, see [23, 105], for which a semi-analytic solution exists in terms 
of the propagator matrix technique [47, 85, 87]. This allowed us to verify suitable 
correctness of our LSQP, approach from Sect. 3.6. We then compared models 
with increasing levels of complexity, see Fig. 17, going up to a profile with lateral 
viscosity variations due to varying thickness of the lithosphere and temperature- 
dependent variations in the lower mantle. Full details and a discussion of the 
outcomes can be found in [10]. 
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4 Conclusions and Future Work 


TERRANEO is a project in Computational Science and Engineering [91]. We suc- 
cessfully developed innovative scientific computing methods for the exascale era. 
To this end, new algorithms were devised for simulating Earth Mantle convection. 
We analyzed the accuracy of these methods and their cost on massively parallel 
computers, showing that our methods exhibit excellent performance. A highly 
optimized prototype implementation was designed using the HHG multigrid library. 
Based on HHG, we demonstrated computations with an unprecedented resolution of 
the Earth’s Mantle based on real-world geophyical input data. 

In TERRANEO many new methods for future exascale geodynamic simulations 
were invented: they include new solver components for monolithic multigrid 
methods for saddle point problems, specially designed smoothers and new strategies 
to solve the coarse grid problems. Two new classes of matrix-free methods were 
introduced and analyzed, one based on surrogate polynomials, the other one based 
on stencil scaling. New scheduling algorithms for parallel multilevel Monte Carlo 
methods and uncertainty quantification were studied. Methods for fault tolerance 
were investigated. This includes methods based on in-memory checkpointing as 
well as new methods for fast algorithmic reconstruction of lost data when hard faults 
occur. Inverse problems in mantle convection were studied using adjoint techniques. 

Careful parallel performance studies complement our work and assess the 
suitability of the algorithmic components on future exascale computers. Many of 
the methods were tested on application oriented problems, such as for example a 
geophysics study on the relation of the thickness of the asthenosphere and upper 
mantle velocities and the influence of viscosity variations on dynamic topography. 

We found that conventional C++-based implementation techniques lack expres- 
siveness and flexibility for massively parallel and highly optimized parallel codes 
and that achieving performance portability is a major difficulty. Starting from this 
insight, we invented new programming techniques based on automatic program 
generation, learning from neighboring SPPEXA projects such as ExaStencils. These 
ideas have already been realized in HYTEG and WALBERLA as new and innovative 
simulation software architectures for multiphysics exascale computing. 

Thus, some aspects of our research in TERRANEO are of mathematical nature, 
others fall into the field of computer science. Our methods have also already 
been used to perform research in geophysics, the target discipline of TERRANEO. 
However, we point out that the research contribution of TERRANEO falls neither 
into the intersection of all these fields, nor can our contribution be understood from 
the viewpoint of mathematics or computer science alone. It is also no project in the 
geosciences, since its primary goal is not the creation of new geophysical insight. 
The goal of TERRANEO is the construction and analysis of simulation methods that 
are the enabling technologies for future exascale research in geodynamics. 

This result could not be reached by either of the classical disciplines alone. The 
synthesis of knowledge and methods from mathematics, computer science, and 
geophysics becomes more than the sum of its parts. We emphasize additionally 
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that our research in computational science and engineering is not restricted to the 
application in geophysics. Many of the innovations developed in TERRANEO can 
be transferred to other target disciplines. 

With our new computational methods, we have broken previously existing 
barriers to computational performance. This is demonstrated for example by our 
solution of indefinite linear systems that have in excess of 10! degrees of freedom. 
To our knowledge this constitutes the largest finite element computation published 
to date, even though the computation was still performed on JUQUEEN, a machine 
that is now outdated and was already retired. 

TERRANEO funding in the last period was reduced from the desired four to 
only three positions so that the new software design and its development could 
not be completed as originally proposed. A preliminary version of the TERRANEO 
software will be made public and will contain essential parts of the promised core 
functionality, but it still lacks most of the application oriented functionality that 
would make it fully usable as a Geophysics community code. Efforts will be made 
to continue the development so that the central research goal of TERRANEO can be 
reached. 
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